chore: Harden keycast autoscaling for a possible viral spike#196
Draft
dcadenas wants to merge 6 commits into
Draft
chore: Harden keycast autoscaling for a possible viral spike#196dcadenas wants to merge 6 commits into
dcadenas wants to merge 6 commits into
Conversation
Rework the SIGTERM handler in keycast/src/main.rs so that scale-down events
on GKE don't drop in-flight NIP-46 RPCs mid-request.
Phase ordering is now:
1. Flip /healthz/ready to 503 and sleep SHUTDOWN_PRE_DRAIN_SECS (default
10s). During this window the kubelet sees the failing probe and the
Service controller removes the pod from the EndpointSlice, so the GCLB
stops routing new HTTP and new NIP-46 subscriptions to us before we
stop accepting.
2. Tell axum to stop accepting and wait up to SHUTDOWN_HTTP_DRAIN_SECS
(default 45s) for in-flight HTTP requests to finish.
3. Tear down the nostr_sdk Client (stops signer.run() and the relay
worker loop, flushes in-flight sign responses back to clients) and
wait up to SHUTDOWN_SIGNER_DRAIN_SECS (default 10s) for tracked
background tasks. This is the key fix: previously client.shutdown()
ran BEFORE the drain waits, so any in-flight NIP-46 request in the
relay queue lost its WebSocket back to the requesting client.
4. Close the DB pool.
The three windows are env-var-tunable so the iac repo can adjust them
without a rebuild. Sum of defaults (65s) plus DB/tracing teardown headroom
fits inside a terminationGracePeriodSeconds of 75s. The iac-side PR bumps
grace accordingly.
Covered by 6 new unit tests:
- parse_shutdown_timings defaults / overrides / zero-is-valid /
invalid-falls-back-to-default
- pre_drain_pause flips the readiness flag BEFORE sleeping
(tokio::time pause + advance, deterministic)
- pre_drain_pause with zero duration returns immediately
Doc references for the design choices:
- K8s HPA v2 behavior defaults (Percent 100/15s + Pods 4/15s, scaleDown
stabilizationWindowSeconds: 300). HPA tune itself lands in the iac
PR, but the phase budgets here are sized against the iac PR's
terminationGracePeriodSeconds bump to 75s.
- Tokio graceful-shutdown topic
(https://tokio.rs/tokio/topics/shutdown) — three-part model (signal →
cancel → wait). Already followed by existing code via Notify +
TaskTracker; this change tightens phase ordering.
- K8s pods-and-endpoint-termination-flow — endpoint-slice removal is
asynchronous vs SIGTERM, hence the in-process sleep before closing the
accept loop.
Out of scope for this repo: HPA tune, PodDisruptionBudget,
topologySpreadConstraint, terminationGracePeriodSeconds — those are in
divine-iac-coreconfig. Adding tokio test-util to dev-dependencies so we
can test the pre-drain pause deterministically with tokio::time::advance.
…startup
Adds a startup-time sanity check so a misconfigured phase duration is
reported via tracing::warn rather than surfacing as a kubelet SIGKILL
mid-drain on scale-down. Addresses two concrete failure modes flagged
in review:
1. Operator sets SHUTDOWN_HTTP_DRAIN_SECS=120 against a 75s grace
period. Previously silent; now a loud warn-level log at boot
includes the three phase values, the computed total + margin, and
the configured ceiling.
2. Sibling divinevideo/divine-iac-coreconfig PR (tracked in #692)
lands with terminationGracePeriodSeconds below the current keycast
defaults. Same warning surface.
New surface area:
- SHUTDOWN_TEARDOWN_MARGIN (const, 10s): reserved headroom between
the end of the phased drain and SIGKILL for DB pool close + tracing
flush. Subtracted from the ceiling during validation.
- DEFAULT_SHUTDOWN_GRACE_CEILING_SECS (const, 120s): conservative
upper bound covering the issue spec's 60-90s range with headroom.
Override via SHUTDOWN_GRACE_PERIOD_CEILING_SECS.
- ShutdownTimings::total_budget(): pre_drain + http_drain +
signer_drain (no margin).
- validate_shutdown_timings(&ShutdownTimings, Duration) -> Result<(),
String>: returns Err with an operator-actionable message if total +
margin > ceiling. Exact equality is accepted (margin fully fits).
The call site in async_main warns rather than refusing to boot: the
ceiling is cross-repo-coupled to the iac repo, and an operator may
legitimately need to start before the iac repo catches up. Marked with
TODO(#692) so the warn-only fallback can be tightened once the iac PR
lands and terminationGracePeriodSeconds is pinned at >=75s everywhere.
Default phase values (65s sum + 10s margin = 75s) are unchanged,
matching the iac PR target. Decision: keep defaults rather than shrink
to fit a 60s grace, because the iac PR targets 75s and shrinking would
reduce the HTTP-drain budget available for legitimately long in-flight
sign requests. The validation path is the defense-in-depth for the
smaller-grace case, with SHUTDOWN_* env vars as the tuning knobs.
Covered by 7 new unit tests (31 total in keycast bin, all passing):
- parse_shutdown_grace_ceiling default / override
- ShutdownTimings::total_budget arithmetic
- validate_shutdown_timings ok when under ceiling (75s, 90s)
- validate_shutdown_timings err when over ceiling (60s grace vs 65s
+ 10s margin, from review)
- validate_shutdown_timings err when single field overshoots
(SHUTDOWN_HTTP_DRAIN_SECS=120 case from review)
- validate_shutdown_timings boundary: equal-to-ceiling accepted,
one-below rejected, one-above accepted
cargo clippy --workspace --all-targets --all-features -- -D warnings
-A deprecated and cargo fmt --all -- --check are clean.
Previously the HTTP-drain timeout arm in async_main just logged 'API
server shutdown timed out' and fell through. Dropping the axum task's
JoinHandle at end of async_main does NOT cancel the underlying task
(tokio+std semantics), so axum would keep running concurrently with
phase 3 (relay client teardown) and phase 4 (DB pool close). Pool
close racing a still-live accept loop surfaces as DB errors on any
late-arriving requests, and the kubelet SIGKILL at
terminationGracePeriodSeconds would be the actual stop — defeating
the point of a bounded HTTP drain budget.
Extracted the drain into a helper (drain_http_or_abort) that takes the
JoinHandle and budget, awaits the task up to the budget via
tokio::time::timeout with &mut handle (JoinHandle: Unpin + Future, so
timeout borrows rather than consumes it), and on timeout calls
handle.abort() then awaits the cancellation before returning. Returns
a two-variant HttpDrainOutcome so the caller logs an honest message
in both arms instead of the old 'timed out' line that implied axum
had stopped.
The caller in async_main now logs:
'API server shutdown timed out; axum task aborted to prevent DB
pool close from racing late requests'
on the timeout path, which is what actually happens.
Covered by 2 new tokio::test(start_paused = true) unit tests:
- drain_http_or_abort returns Completed when the task finishes inside
the budget (fast path).
- drain_http_or_abort aborts the task when the budget is exceeded,
using a Drop-guarded cancellation flag in the spawned task to
prove the task is actually stopped (not just left pending with
the join result thrown away). This directly reproduces the review
failure mode.
33 unit tests in the keycast bin now all pass. cargo clippy --workspace
--all-targets --all-features -- -D warnings -A deprecated and cargo fmt
--all -- --check are clean.
Phase 4 (DB pool close) previously awaited sqlx's Pool::close() with
no upper bound, contradicting the design contract that documented
SHUTDOWN_TEARDOWN_MARGIN (10s) as the headroom reserved inside
terminationGracePeriodSeconds for post-drain cleanup. sqlx's
Pool::close() waits for every checked-out connection to be returned,
so a connection stuck in a long-running query — e.g. a tenant-cache
preload task mid-query when task_tracker.wait() timed out and
signer_handle.abort() was called but not awaited — could block past
the kubelet SIGKILL at terminationGracePeriodSeconds. That would
swallow the final 'Graceful shutdown complete' log, leaving operators
with no positive signal of clean shutdown.
Extracted close_within_margin(close_fut, margin) -> PoolCloseOutcome
which wraps the future in tokio::time::timeout. Generic over
Future<Output = ()> so the behaviour can be tested deterministically
with tokio::time::pause without standing up a real Postgres pool.
sqlx's Pool::close is cancellation-safe, so dropping the future on
timeout is safe.
The call site in async_main now logs two distinct lines:
- 'Graceful shutdown complete' on Completed (the happy path).
- 'Graceful shutdown complete with warning: DB pool close exceeded
teardown margin; stuck checked-out connections were dropped' on
TimedOut (still emitted so operators have a positive signal).
Covered by 3 new tokio::test(start_paused = true) unit tests:
- returns Completed when the future finishes inside the margin.
- returns TimedOut when the future blocks past the margin.
- enforces the real bound: measures virtual-time elapsed and asserts
the helper returns within the margin + 100ms even when the inner
future would take 1h. This is the safety property — phase 4
cannot eat into the kubelet's SIGKILL window.
36 unit tests in the keycast bin now all pass. cargo clippy
--workspace --all-targets --all-features -- -D warnings -A deprecated
and cargo fmt --all -- --check are clean.
Regression defense. `parse_duration_secs_env` previously accepted any
u64, so a misconfig like SHUTDOWN_PRE_DRAIN_SECS=18446744073709551600
parsed cleanly. The value then flowed into ShutdownTimings, where
total_budget()'s `+` sum panicked on Duration overflow — aborting
the process at startup with a raw 'overflow when adding durations'
panic BEFORE validate_shutdown_timings could surface the operator-
actionable 'does not fit inside ceiling' warn. Net result: crashloop
instead of loud warning.
Added SHUTDOWN_PER_FIELD_CEILING_SECS const (24*3600 = 86400s = 1 day).
parse_duration_secs_env now rejects any value strictly greater than
the ceiling and falls back to the per-field default with a warn-level
log naming the env var, the bad value, and the ceiling. Three fields
at the ceiling sum to 3 * 86400 = 259200s = 3 days, comfortably inside
u64, so total_budget() can stay infallible.
One day is far above any plausible operational value — real drain
budgets are seconds, not hours — and the ceiling is documented on
the constant so an operator who genuinely wants a larger value knows
where to look. The warn log still emits at boot so the misconfig
doesn't silently revert.
Covered by 4 new unit tests:
- u64::MAX-ish input falls back to default AND downstream
total_budget() does not panic (the actual safety property).
- value strictly above the ceiling falls back to default.
- value exactly at the ceiling is accepted (legitimate opt-in).
- total_budget() with every field at the ceiling computes 3 *
ceiling without overflow and equals Duration::from_secs(3 *
ceiling).
40 unit tests in the keycast bin now all pass. cargo clippy
--workspace --all-targets --all-features -- -D warnings -A deprecated
and cargo fmt --all -- --check are clean.
Phase 3's `client_for_shutdown.shutdown().await` was previously awaited
unbounded in `async_main` — only the trailing `task_tracker.wait()`
was wrapped in `tokio::time::timeout(signer_drain, ...)`. A hung
WebSocket close handshake, a bug in `nostr_sdk::Client::shutdown`, or
a dead relay socket could therefore block phase 3 past
`terminationGracePeriodSeconds` and swallow the kubelet SIGKILL entry
path: phase 4 (DB pool close, already bounded by
`SHUTDOWN_TEARDOWN_MARGIN`) would never run, and the final
`Graceful shutdown complete` log would be lost.
This silently busted the design contract enforced by
`validate_shutdown_timings`:
pre_drain + http_drain + signer_drain + SHUTDOWN_TEARDOWN_MARGIN
<= terminationGracePeriodSeconds
Relay teardown was not in that sum.
Extracted the phase into `drain_signer_or_abort(client_shutdown,
tracker_wait, abort_signer, budget)` which wraps the whole phase in
ONE outer `tokio::time::timeout(budget, ...)` rather than per-await
timeouts. One shared outer bound is strictly more correct than two
independent inner timeouts: two independent `signer_drain`-sized
timeouts would make phase 3 worst-case `2 * signer_drain` and quietly
bust the same contract we were trying to restore. The helper awaits
`client_shutdown` then `tracker_wait` sequentially inside the
timeout; on timeout it invokes the caller-supplied `abort_signer`
callback (which in production calls `signer_handle.abort()`) and
returns `AbortedAfterTimeout`.
The helper is generic over both inner futures and the abort callback
so the timeout behaviour can be exercised deterministically with
`tokio::time::pause` in tests without standing up a real
`nostr_sdk::Client` or a real signer `JoinHandle`.
Call site in `async_main`:
match drain_signer_or_abort(
client_for_shutdown.shutdown(),
task_tracker.wait(),
|| signer_handle_for_abort.abort(),
timings.signer_drain,
).await { ... }
replaces the separate `client_for_shutdown.shutdown().await` +
`tokio::time::timeout(signer_drain, task_tracker.wait())` block. The
log line on timeout is updated to reflect the new truth — either
`client.shutdown()` or `task_tracker.wait()` (or both) overran the
budget.
Covered by 5 new `tokio::test(start_paused = true)` unit tests:
- happy path: both futures finish inside budget → Completed, abort
NOT called
- client.shutdown() hangs → AbortedAfterTimeout, abort called
(the reviewer-flagged regression target)
- task_tracker.wait() hangs → AbortedAfterTimeout, abort called
(pre-patch behavior preserved under new outer-timeout structure)
- safety bound: helper returns within `budget` even when both inner
futures would sleep 1h
- shared-budget semantic: 8s + 8s sequential awaits against a 10s
budget must time out at 10s, not extend to 16s (pins the
single-outer-timeout vs two-independent-timeouts choice)
45 unit tests in the keycast bin now all pass (40 previously + 5
new). cargo test --workspace --lib --bins is clean.
cargo clippy --workspace --all-targets --all-features -- -D warnings
-A deprecated and cargo fmt --all -- --check are clean.
Relates to #692.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Plain summary
On a scale-down event, keycast could drop in-flight NIP-46 sign requests before their responses got back to the client. That would cause a reconnect storm right when we are trying to reduce pod count. This PR reorders shutdown into four explicit bounded phases so K8s stops routing traffic to the pod before we stop accepting, HTTP drains first, and only then do we tear down the relay client that the signer workers need. Every phase budget is env-var-tunable and validated at startup against a grace-period ceiling.
Part of the umbrella autoscaling hardening in issue #692. The Kubernetes-side pieces (HPA tune, PDB, topology-spread,
terminationGracePeriodSeconds) land in a sibling PR againstdivinevideo/divine-iac-coreconfig— this PR is the keycast (Rust) slice.Motivation
Keycast holds two pieces of per-pod state that matter during shutdown: long-lived NIP-46 WebSocket relay connections, and a bounded queue of in-flight sign requests being processed by relay workers. The old shutdown block did two things in the wrong order:
/healthz/readyto 503 and immediately told axum to stop accepting. K8s EndpointSlice propagation is asynchronous, so the load balancer kept routing new HTTP and NIP-46 subscriptions to us for several seconds while we rejected them.client.shutdown()before awaiting HTTP drain. In-flight NIP-46 requests in the relay worker queue lost their WebSocket back to the requesting client and could not deliver responses.On a scale-down that amplifies into a reconnect/retry storm.
Related Issue
What Changed
The shutdown sequence in
async_mainis now four explicit phases, with five defensive additions accumulated across three review cycles.shutting_down = true(readiness returns 503) and sleepSHUTDOWN_PRE_DRAIN_SECS(default 10s). This is the window during which K8s observes the failing probe and removes us from the Service EndpointSlice.SHUTDOWN_HTTP_DRAIN_SECS(default 45s). On timeout the axum task is aborted viadrain_http_or_abortand awaited to cancellation, so phases 3 and 4 do not race a still-live accept loop — previously theJoinHandlewas dropped, which does NOT cancel the task under tokio+std semantics.nostr_sdkClient(which stopssigner.run()and the relay worker loop so in-flight sign responses can still be published on the way out) AND wait for tracked background tasks, both under a single sharedSHUTDOWN_SIGNER_DRAIN_SECS(default 10s) budget viadrain_signer_or_abort. Previouslyclient.shutdown()was unbounded and only the trailingtask_tracker.wait()was timeout-wrapped. On timeout the signerJoinHandleis aborted.SHUTDOWN_TEARDOWN_MARGIN(10s) viaclose_within_margin. sqlx'sPool::closewaits for every checked-out connection to be returned, so a stuck query could previously block past the kubelet SIGKILL and swallow the final log line. Now the margin is a real bound; on timeout we still emit a warn-level 'Graceful shutdown complete with warning' so operators have a positive signal.At startup
validate_shutdown_timingschecks thatpre + http + signer + 10s teardown margin <= SHUTDOWN_GRACE_PERIOD_CEILING_SECS(default 120s). If the configured budget does not fit, a warn-level log names the four phase values and the ceiling. We warn rather than refuse to boot because the ceiling is cross-repo-coupled to the iac repo. Marked withTODO(#692)so the warn-only fallback can tighten once the iac PR pinsterminationGracePeriodSecondsat >=75s everywhere.Every
SHUTDOWN_*_SECSenv var is clamped through a per-field sanity ceiling of 1 day (SHUTDOWN_PER_FIELD_CEILING_SECS = 24*3600). Values above that fall back to the per-field default with a warn-level log. This prevents a typo likeSHUTDOWN_PRE_DRAIN_SECS=18446744073709551600from overflowingShutdownTimings::total_budget()'s sum and crashlooping before the validate-and-warn path can surface an operator-actionable message.New env vars (all optional, all have safe defaults):
SHUTDOWN_PRE_DRAIN_SECS— default 10.SHUTDOWN_HTTP_DRAIN_SECS— default 45.SHUTDOWN_SIGNER_DRAIN_SECS— default 10.SHUTDOWN_GRACE_PERIOD_CEILING_SECS— default 120.Design choice: one shared phase-3 bound vs two independent ones
The cycle-3 review suggested 'wrap
client_for_shutdown.shutdown()intokio::time::timeout(timings.signer_drain, ...)so it shares the same bound astask_tracker.wait()'. Taken literally that means two independent timeouts each sized atsigner_drain, which in the worst case makes phase 3 run for2 * signer_drainand quietly busts the design contract thatvalidate_shutdown_timingsenforces. Chose the stricter option: one outertimeout(signer_drain, ...)over the whole phase viadrain_signer_or_abort. Same happy-path behavior, strictly tighter worst-case bound, no env-var surface change, no iac contract change.Docs consulted
Notify+TaskTracker; this change tightens phase ordering and makes aborts actually happen on timeout.pods-and-endpoint-termination-flowtutorial — endpoint-slice removal is asynchronous vs SIGTERM, hence the in-process sleep before closing the accept loop.Cross-repo coupling note (one-way door)
The commit
b74aabaalready onmain(#174) widenedregistered_clients.tenant_idfromINTEGERtoBIGINTviadatabase/migrations/20260429000000_widen_registered_clients_tenant_id.sqlAND removed the::BIGINTcasts from the repository queries incore/src/repositories/registered_client.rsin the same commit.cloudbuild.yamlruns migrations before deploy so the happy path is safe, but if a deploy boots against a DB that has not applied that migration (dev DB restored from a pre-migration snapshot, revision rollback reverting the column but not the code, run-migrations step failing before deploy proceeds) every query in that repository fails at decode time withmismatched types: Rust type i64 (as SQL type int8) is not compatible with SQL type int4and OAuth client management breaks entirely. Raised as a one-way-door risk by review. No code in this PR touches that file; this callout is for awareness.Testing
I ran the keycast-bin unit tests, workspace lib+bin tests, clippy, and fmt. The new code has 27 shutdown-specific unit tests: 7 for env parsing (defaults, overrides, zero, invalid, per-field ceiling rejection/acceptance, u64::MAX regression), 2 for the pre-drain flag-flip ordering, 7 for the grace-period ceiling validation, 2 for the HTTP-drain abort-on-timeout helper, 3 for the DB-pool-close margin helper, 1 for
total_budgetnot overflowing at the per-field ceiling, and 5 for the cycle-3drain_signer_or_aborthelper.cargo test -p keycast --bin keycast— 45 passed / 0 failed.cargo test --workspace --lib --bins— all passed.cargo clippy --workspace --all-targets --all-features -- -D warnings -A deprecated— clean.cargo fmt --all -- --check— clean.cargo test --workspace --verbose— 11 DB-backed integration tests (api/tests/atproto_http_test.rs,core/tests/auth_event_repository_test.rs) fail locally withFailed to run migrations: VersionMissing(20260429030500). That migration was removed/renumbered in source but my localkeycast_testPostgres still has a record of having run it; reproduces on unmodifiedmainand is unrelated to this PR's diff. CI with a fresh DB should be green.Not tested in this PR: full kill-pod-and-observe-reconnect-spike verification. That is integrated acceptance (keycast + iac) and needs the sibling iac PR to land first so
terminationGracePeriodSecondsis wide enough for the new phase budget.Risks
The main risk is a grace-period mismatch with the sibling iac PR. The defaults target the iac PR's 75s goal, and the startup ceiling validation plus per-field sanity ceiling surface misconfig in boot logs rather than letting it silently kill shutdown mid-flight. All five defensive changes (abort-on-timeout for HTTP, pool-close margin, startup ceiling validation, per-field sanity clamp, shared phase-3 bound) are strictly safer than the original phased-shutdown diff.
terminationGracePeriodSecondsbelow 75s. The validation logs a warn-level event at boot naming the exact mismatch. Env vars let ops tune down without a rebuild.task_tracker.wait(): phase 3 is bounded bysigner_drain; on timeout the signer task is aborted.SHUTDOWN_TEARDOWN_MARGIN. On timeout we still emit the final log (with warn level) so operators have a positive signal.total_budget.SHUTDOWN_PRE_DRAIN_SECS=0short-circuits the pre-drain sleep.registered_clients.tenant_idcallout above. Not touched by this PR; flagged for awareness.Follow-ups
Tracked in issue #692:
PriorityClass: -1) to absorb cluster-autoscaler node-provisioning latency.validate_shutdown_timingsfrom warn to hard-fail once the iac-sideterminationGracePeriodSecondsis pinned at >=75s everywhere.Visuals