docs: update atproto auth server runbooks#4
Merged
Conversation
rabble
added a commit
that referenced
this pull request
Jun 5, 2026
…diness (#12) * docs: add entryway phase 2 boundary plan * docs: reconcile atproto rollout runbooks * feat: change crosspost_enabled default to TRUE for new accounts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add enable_account_link query and store method Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add POST /enable endpoint to re-enable crossposting Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add BAD_GATEWAY test for enable endpoint sync failure Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: accept optional crosspost_enabled in opt-in request Threads crosspost_enabled from the HTTP request body through to the DB upsert, eliminating the race window where a user unchecks "Publish to Bluesky" before the account is created. Field defaults to true so existing callers are unaffected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: verify opt-in with crosspost_enabled=false creates disabled account Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: apply cargo fmt across workspace (baseline cleanup) The repo's launch-checklist preflight requires `cargo fmt --check`, but main had accumulated 31 unformatted hunks across divine-atbridge, divine-handle- gateway, and divine-localnet-admin. Run `cargo fmt --all` to clear them. Pure formatting, no behavior change. Kept separate from feature work per AGENTS.md (no mixing formatting churn into scoped changes). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(rollout): add per-chunk sub-plans and fold in verified swarm findings 10 detailed per-chunk implementation sub-plans under docs/superpowers/plans/2026-05-30-rollout-chunks/, produced by a 26-agent decomposition swarm and adversarially verified. Master plan updated with four corrections (3 independently re-verified): - CRITICAL: PDS_OAUTH_AUTHORIZATION_SERVER and PDS_ENTRYWAY_DID are unset in rsky-pds IAC, so /.well-known/oauth-protected-resource 404s and entryway token-trust is inert regardless of image. Chunk B now wires env vars and precedes A; Chunk D is blocked until PDS_ENTRYWAY_DID exists. - E: mobile blueskyPublishing flag exists (off); only enableAtprotoPublishing /atprotoPublishing are phantom. - H: no app-level rate limiting in keycast/rsky; enforce at NGINX edge. - A: smoke script had no token_endpoint assertion; A2 sweep is a no-op (remaining /api/oauth refs are legitimate generic-oauth/rejected-option). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(smoke): assert real /api/atproto/oauth endpoints (Chunk A1) The entryway authorization-server metadata advertises the ATProto OAuth flow at /api/atproto/oauth/{authorize,par} (verified against keycast origin/main + deployed bd92361). The script asserted /api/oauth/*, which would fail against the live server. /api/oauth/* remains keycast's generic Nostr/UCAN OAuth. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(rollout): fix ATProto auth-server metadata path in boundary plan (Chunk A2) The auth-server-metadata test example asserted /api/oauth/* but the example is unambiguously the ATProto profile (scopes_supported: atproto, require_pushed_authorization_requests), so it must be /api/atproto/oauth/*. This is the single genuine A2 reference; other /api/oauth/* refs describe keycast's generic Nostr/UCAN OAuth and are left intact. Master plan correction #4 updated to match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(atbridge): add lease-expiry + failed-backfill watchdog metrics (Chunk G2) Adds WatchdogMetrics (expired_leases, failed_backfills) to RuntimeHealthState and a read-only count_watchdog_metrics() that counts publish_jobs stuck in_progress past their lease and account_links with publish_backfill_state= 'failed' — the launch-checklist ops queries, surfaced on /metrics and logged by a spawned poll loop. The Diesel query is offloaded via spawn_blocking. Verified with `cargo check -p divine-atbridge` (passes). The new code is clippy-clean; the workspace `clippy -D warnings` gate is NOT green due to PRE-EXISTING lints in unrelated files (tracked in a separate commit). Integration coverage needs TEST_DATABASE_URL (no DB in this env), so the runtime poll path is type-checked only, not exercised. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * style: clear two pre-existing clippy -D warnings lints The launch-checklist preflight requires clippy --all-targets -- -D warnings, but it was already red from merged code (not from this branch's work): - queries.rs cancel_publish_job: collapsible_if — the outer (Published || Skipped) guard is redundant since the inner arm requires Skipped; collapsed to the single equivalent condition. - video_service.rs: unnecessary_lazy_evaluations — unwrap_or_else(|| x) -> unwrap_or(x) for a non-lazy value. Behavior-preserving. Does not clear ALL workspace lints (a separate await_holding_lock in a test remains); scoped to the two trivial fixes adjacent to this branch's changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(atbridge): make watchdog opt-in, off by default (Chunk G2 follow-up) The initial G2 commit wired spawn_watchdog unconditionally into the prod spawn() entry point — a live 30s DB-poller on every atbridge startup. That exceeded the intended scope (a passive metric). Gate it behind a new WATCHDOG_ENABLED config flag (default false), mirroring VIDEO_SERVICE_ENABLED, with WATCHDOG_INTERVAL_SECS (default 30) for cadence. Raw PgConnection::establish per tick matches the existing atbridge norm (runtime.rs, provision_runtime.rs). Verified: cargo check --all-targets passes; cargo clippy --lib -- -D warnings clean. (Pre-existing await_holding_lock lints in test files remain, out of scope.) Runtime poll path still needs TEST_DATABASE_URL to exercise. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(atbridge): silence false-positive clippy lints in DB test harness Clears the remaining clippy -D warnings failures so the launch-checklist preflight passes for the atbridge/bridge-db crates. All pre-existing in merged test code (PR #7), all false positives: - await_holding_lock (5 files): tests hold a process-wide marker Mutex<()> across .await to serialize access to the shared test database; the guard carries no data, so this is intentional and safe. Scoped file-level allow. - too_many_arguments: seed_account mirrors the account_links schema; the arg count is inherent, not a smell. Behavior unchanged. Does not touch the workspace-wide pre-existing cargo-fmt drift (left out to keep this branch scoped per AGENTS.md). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(atbridge): self-apply bridge DB migrations on startup (B2) The bridge DB had NO migration automation: no db-migrate Job in IaC, no embedded runner, and Dockerfile CMD never applied the copied migrations/. So after PR #7 the running bridge could reference 004_publish_job_scheduler columns the live schema lacked, silently breaking publishing (reproduced: "column nostr_pubkey does not exist" against a 001-only DB). Fix: the bridge now applies its owned migrations on startup before touching the DB (divine-bridge-db::run_pending_migrations, called from main). The SQL is made idempotent (CREATE TABLE/INDEX IF NOT EXISTS, ADD COLUMN IF NOT EXISTS, ALTER COLUMN SET DEFAULT) so it is safe against any existing schema state — fresh, partial, or fully hand-migrated prod. Embedded via include_str! so it works regardless of container CWD, applied in explicit order which also sidesteps the duplicate 004_ directory prefix without renaming files. Test migrations_are_idempotent_across_db_states covers fresh, re-run, and the real prod partial-state (001 only -> adds 004 columns). Full suite: 125 pass, clippy -D warnings clean, fmt clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(runbook): update crosspost blockers — placeholder fixed, secrets still missing After the first IaC pass: the ENVIRONMENT-placeholder issue (B1b) is fixed — live ExternalSecrets now request correct *-production keys. The remaining blocker (B1a) is that the GCP Secret Manager secrets still don't exist (ESO: "Secret does not exist", confirmed NotFound not IAM). Added the exact complete checklist of all 25 missing secret keys across atbridge/handle-gateway/rsky-pds, pulled live from the cluster. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(runbook): staging is the proven template for prod crosspost promotion Verified the bridge runs healthy in dv-platform-staging (atbridge/handle-gateway/ rsky-pds all 1/1 Running, ExternalSecrets SecretSynced=True). So prod is a deployment/secrets gap, not a code issue. Promotion = recreate the staging secret key set with -production suffix in dv-platform-prod. Also flagged: staging runs an older image (3ba324c) that predates the B2 self-migration, so prod must pin a B2-containing build (048c293+) for startup migration to run, else apply schema manually. No POC bridge deployment exists — staging is the only reference. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(plan): ATProto production promotion plan (staging->prod) Evidence-based promotion plan grounded in the live staging<->prod delta: - A: create 25 missing *-production secrets in dv-platform-prod (staging is the key-set template; values must be prod-unique) - B: fix 3 IaC overlay bugs (rsky-pds patches only 3/8 secret keys; labeler PRODUCTION_DID_PLACEHOLDER; atbridge VIDEO_SERVICE_ENABLED unset) - C: pin 5 prod images off latest (atbridge must include self-migration 048c293) - D: deploy, verify startup migration + e2e crosspost - E: wire rsky-pds entryway env for third-party ATProto login Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(plan): correct false "staging works" premise; add Chunk F (prove e2e first) Data-plane check disproved the promotion premise: staging bridge has NEVER crossposted — account_links=0 (incl pending), publish_jobs=0, published record_mappings=0, and the staging relay wss://relay.staging.dvines.org is currently DOWN. So the e2e path is unproven anywhere; "promotion, not a build" was verified at the pod level only. - Add Chunk F: fix staging relay + prove ONE real crosspost in staging before touching prod (gives a known-good reference). - Reorder: F first. - Task A2: guard against minting fresh PLC/signing keys if prod PDS DB already has repos (orphan risk) — zero pods != empty DB. - Mark this plan authoritative over the 05-30 plan's overlapping prod chunks. - Runbook: downgrade "proven template" -> "config template, not proof of function". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(atbridge): POST PLC genesis op to /:did, not / (provisioning root cause) create_did derived the did:plc but POSTed the genesis operation to the bare root `/` instead of `/:did` as the PLC HTTP API requires. plc.directory returns 404 on `POST /`, which the bridge wrapped as a 500 — so account provisioning has NEVER succeeded (staging account_links=0, zero crossposts ever). Verified empirically against live plc.directory from the sky namespace: `POST /` -> 404, `POST /did:plc:...` -> 400 "Not a valid operation" (correct route, reached the validator). Fix: POST to did_endpoint(&derived_did). Removed now-unused endpoint(). TDD: new test create_did_posts_to_the_derived_did_path pins the route; updated the create_did_* unit tests and the provision_api integration test, which had all mocked the buggy `POST /` and thus masked the bug. 126 tests pass, clippy -D warnings + fmt clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(plan): record PLC root cause; F2 blocked on fix; add Chunk G (error visibility) - Chunk F2: first staging provision attempt FAILED on the PLC /:did bug (now fixed in code); a staging image from that commit must deploy before e2e works. - Note staging ATPROTO_PROVISIONING_TOKEN=placeholder. - Add Chunk G: provisioning errors are swallowed (Display not {:#}), making failures undebuggable in prod — fix with TDD. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(atbridge): derive did:plc from the SIGNED genesis op, not unsigned derive_did_plc hashed the unsigned operation (sig stripped), but the did:plc spec and the rsky reference compute base32(sha256(SIGNED op))[..24]. Since the PLC directory returns an empty body on create, create_did falls back to the derived DID AND posts the genesis op to POST /:did — so a wrong (unsigned) derivation both mislabels the account and gets rejected by the directory (DID/op mismatch). This compounds with the /:did route fix: provisioning needs BOTH to succeed. Keep unsigned_operation_bytes for signing (you sign the unsigned op); add signed_operation_bytes for derivation. TDD: derive_did_plc_hashes_the_signed_operation pins base32(sha256(signed dagcbor))[..24] and that the sig affects the DID. 127 tests pass, clippy -D warnings + fmt clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(plan): record 2nd PLC fix + audited next wall (PDS createAccount/activate) Both PLC bugs (route + signed-op derivation) fixed and TDD-covered. Documented the next likely failure after they deploy — rsky-pds createAccount with a supplied did needs admin auth, maybe password/invite, and creates the account DEACTIVATED (likely needs an activate step). Flagged to verify against the live PDS rather than speculatively patch an unexercised path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(plan): audit walls 2-4 — confirmed activate-account gap + open publish-auth Q Audited the never-run provision/publish path deeper against the rsky reference: - Wall 3 (CONFIRMED bug): rsky creates did-import accounts deactivated=true; the bridge never calls activateAccount, so repos won't serve records. Fix is concrete (capture createAccount accessJwt, call activateAccount as that DID) but held until staging confirms the deactivated state empirically. - Wall 4 (OPEN): publisher writes records with a single shared Bearer token, but repo writes must be authed as the repo DID — needs the auth_verifier behavior or a maintainer's design intent before building. Deliberately not patching blind; documented for the live round-trip. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(runbook): turnkey build+deploy for the atbridge PLC-fix image (Gate 1) Paste-ready build/push/overlay-bump/verify sequence with the real staging + prod registry paths, so whoever has registry access can ship the fix and unblock the e2e crosspost test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(plan): live-PDS evidence confirms Wall 3 (24/33 staging repos inactive) Probed the deployed staging PDS: 33 repos with *.staging.dvines.org handles (bridge HAS provisioned historically — corrects "never succeeded"), but 24 are active:false — live proof the activate-account gap (Wall 3) is real and common. Active repos only hold empty smoke-test stubs; no genuine video crosspost has ever been produced. Deployed image 3ba324c still has the `/` PLC bug, so provisioning has been broken since it deployed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Test Plan