Skip to content

docs: update atproto auth server runbooks#4

Merged
rabble merged 4 commits into
mainfrom
plan/phase2-atproto-auth-server
Mar 29, 2026
Merged

docs: update atproto auth server runbooks#4
rabble merged 4 commits into
mainfrom
plan/phase2-atproto-auth-server

Conversation

@rabble
Copy link
Copy Markdown
Member

@rabble rabble commented Mar 28, 2026

Summary

  • update the Phase 2 runbooks for delegated ATProto auth-server discovery, PAR, DPoP-bound token exchange, refresh rotation, and revocation behavior
  • add a dedicated smoke-test runbook for end-to-end delegated auth-server verification
  • record the refresh-and-DPoP completion plan used to finish the Phase 2 protocol slice

Test Plan

  • documentation review against the implemented keycast and rsky worktrees

@rabble rabble merged commit 04e4a4d into main Mar 29, 2026
2 checks passed
rabble added a commit that referenced this pull request Jun 5, 2026
…diness (#12)

* docs: add entryway phase 2 boundary plan

* docs: reconcile atproto rollout runbooks

* feat: change crosspost_enabled default to TRUE for new accounts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add enable_account_link query and store method

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add POST /enable endpoint to re-enable crossposting

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add BAD_GATEWAY test for enable endpoint sync failure

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: accept optional crosspost_enabled in opt-in request

Threads crosspost_enabled from the HTTP request body through to the DB
upsert, eliminating the race window where a user unchecks "Publish to
Bluesky" before the account is created. Field defaults to true so
existing callers are unaffected.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: verify opt-in with crosspost_enabled=false creates disabled account

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: apply cargo fmt across workspace (baseline cleanup)

The repo's launch-checklist preflight requires `cargo fmt --check`, but main
had accumulated 31 unformatted hunks across divine-atbridge, divine-handle-
gateway, and divine-localnet-admin. Run `cargo fmt --all` to clear them.

Pure formatting, no behavior change. Kept separate from feature work per
AGENTS.md (no mixing formatting churn into scoped changes).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(rollout): add per-chunk sub-plans and fold in verified swarm findings

10 detailed per-chunk implementation sub-plans under
docs/superpowers/plans/2026-05-30-rollout-chunks/, produced by a 26-agent
decomposition swarm and adversarially verified.

Master plan updated with four corrections (3 independently re-verified):
- CRITICAL: PDS_OAUTH_AUTHORIZATION_SERVER and PDS_ENTRYWAY_DID are unset in
  rsky-pds IAC, so /.well-known/oauth-protected-resource 404s and entryway
  token-trust is inert regardless of image. Chunk B now wires env vars and
  precedes A; Chunk D is blocked until PDS_ENTRYWAY_DID exists.
- E: mobile blueskyPublishing flag exists (off); only enableAtprotoPublishing
  /atprotoPublishing are phantom.
- H: no app-level rate limiting in keycast/rsky; enforce at NGINX edge.
- A: smoke script had no token_endpoint assertion; A2 sweep is a no-op
  (remaining /api/oauth refs are legitimate generic-oauth/rejected-option).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(smoke): assert real /api/atproto/oauth endpoints (Chunk A1)

The entryway authorization-server metadata advertises the ATProto OAuth
flow at /api/atproto/oauth/{authorize,par} (verified against keycast
origin/main + deployed bd92361). The script asserted /api/oauth/*, which
would fail against the live server. /api/oauth/* remains keycast's generic
Nostr/UCAN OAuth.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(rollout): fix ATProto auth-server metadata path in boundary plan (Chunk A2)

The auth-server-metadata test example asserted /api/oauth/* but the example
is unambiguously the ATProto profile (scopes_supported: atproto,
require_pushed_authorization_requests), so it must be /api/atproto/oauth/*.
This is the single genuine A2 reference; other /api/oauth/* refs describe
keycast's generic Nostr/UCAN OAuth and are left intact. Master plan
correction #4 updated to match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(atbridge): add lease-expiry + failed-backfill watchdog metrics (Chunk G2)

Adds WatchdogMetrics (expired_leases, failed_backfills) to RuntimeHealthState
and a read-only count_watchdog_metrics() that counts publish_jobs stuck
in_progress past their lease and account_links with publish_backfill_state=
'failed' — the launch-checklist ops queries, surfaced on /metrics and logged
by a spawned poll loop. The Diesel query is offloaded via spawn_blocking.

Verified with `cargo check -p divine-atbridge` (passes). The new code is
clippy-clean; the workspace `clippy -D warnings` gate is NOT green due to
PRE-EXISTING lints in unrelated files (tracked in a separate commit).
Integration coverage needs TEST_DATABASE_URL (no DB in this env), so the
runtime poll path is type-checked only, not exercised.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* style: clear two pre-existing clippy -D warnings lints

The launch-checklist preflight requires clippy --all-targets -- -D warnings,
but it was already red from merged code (not from this branch's work):
- queries.rs cancel_publish_job: collapsible_if — the outer
  (Published || Skipped) guard is redundant since the inner arm requires
  Skipped; collapsed to the single equivalent condition.
- video_service.rs: unnecessary_lazy_evaluations — unwrap_or_else(|| x)
  -> unwrap_or(x) for a non-lazy value.

Behavior-preserving. Does not clear ALL workspace lints (a separate
await_holding_lock in a test remains); scoped to the two trivial fixes
adjacent to this branch's changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(atbridge): make watchdog opt-in, off by default (Chunk G2 follow-up)

The initial G2 commit wired spawn_watchdog unconditionally into the prod
spawn() entry point — a live 30s DB-poller on every atbridge startup. That
exceeded the intended scope (a passive metric). Gate it behind a new
WATCHDOG_ENABLED config flag (default false), mirroring VIDEO_SERVICE_ENABLED,
with WATCHDOG_INTERVAL_SECS (default 30) for cadence. Raw PgConnection::establish
per tick matches the existing atbridge norm (runtime.rs, provision_runtime.rs).

Verified: cargo check --all-targets passes; cargo clippy --lib -- -D warnings
clean. (Pre-existing await_holding_lock lints in test files remain, out of
scope.) Runtime poll path still needs TEST_DATABASE_URL to exercise.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(atbridge): silence false-positive clippy lints in DB test harness

Clears the remaining clippy -D warnings failures so the launch-checklist
preflight passes for the atbridge/bridge-db crates. All pre-existing in
merged test code (PR #7), all false positives:
- await_holding_lock (5 files): tests hold a process-wide marker Mutex<()>
  across .await to serialize access to the shared test database; the guard
  carries no data, so this is intentional and safe. Scoped file-level allow.
- too_many_arguments: seed_account mirrors the account_links schema; the arg
  count is inherent, not a smell.

Behavior unchanged. Does not touch the workspace-wide pre-existing
cargo-fmt drift (left out to keep this branch scoped per AGENTS.md).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(atbridge): self-apply bridge DB migrations on startup (B2)

The bridge DB had NO migration automation: no db-migrate Job in IaC, no
embedded runner, and Dockerfile CMD never applied the copied migrations/. So
after PR #7 the running bridge could reference 004_publish_job_scheduler
columns the live schema lacked, silently breaking publishing (reproduced:
"column nostr_pubkey does not exist" against a 001-only DB).

Fix: the bridge now applies its owned migrations on startup before touching
the DB (divine-bridge-db::run_pending_migrations, called from main). The SQL
is made idempotent (CREATE TABLE/INDEX IF NOT EXISTS, ADD COLUMN IF NOT
EXISTS, ALTER COLUMN SET DEFAULT) so it is safe against any existing schema
state — fresh, partial, or fully hand-migrated prod. Embedded via include_str!
so it works regardless of container CWD, applied in explicit order which also
sidesteps the duplicate 004_ directory prefix without renaming files.

Test migrations_are_idempotent_across_db_states covers fresh, re-run, and the
real prod partial-state (001 only -> adds 004 columns). Full suite: 125 pass,
clippy -D warnings clean, fmt clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(runbook): update crosspost blockers — placeholder fixed, secrets still missing

After the first IaC pass: the ENVIRONMENT-placeholder issue (B1b) is fixed — live
ExternalSecrets now request correct *-production keys. The remaining blocker (B1a)
is that the GCP Secret Manager secrets still don't exist (ESO: "Secret does not
exist", confirmed NotFound not IAM). Added the exact complete checklist of all 25
missing secret keys across atbridge/handle-gateway/rsky-pds, pulled live from the
cluster.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(runbook): staging is the proven template for prod crosspost promotion

Verified the bridge runs healthy in dv-platform-staging (atbridge/handle-gateway/
rsky-pds all 1/1 Running, ExternalSecrets SecretSynced=True). So prod is a
deployment/secrets gap, not a code issue. Promotion = recreate the staging secret
key set with -production suffix in dv-platform-prod. Also flagged: staging runs an
older image (3ba324c) that predates the B2 self-migration, so prod must pin a
B2-containing build (048c293+) for startup migration to run, else apply schema
manually. No POC bridge deployment exists — staging is the only reference.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(plan): ATProto production promotion plan (staging->prod)

Evidence-based promotion plan grounded in the live staging<->prod delta:
- A: create 25 missing *-production secrets in dv-platform-prod (staging is the
  key-set template; values must be prod-unique)
- B: fix 3 IaC overlay bugs (rsky-pds patches only 3/8 secret keys; labeler
  PRODUCTION_DID_PLACEHOLDER; atbridge VIDEO_SERVICE_ENABLED unset)
- C: pin 5 prod images off latest (atbridge must include self-migration 048c293)
- D: deploy, verify startup migration + e2e crosspost
- E: wire rsky-pds entryway env for third-party ATProto login

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(plan): correct false "staging works" premise; add Chunk F (prove e2e first)

Data-plane check disproved the promotion premise: staging bridge has NEVER
crossposted — account_links=0 (incl pending), publish_jobs=0, published
record_mappings=0, and the staging relay wss://relay.staging.dvines.org is
currently DOWN. So the e2e path is unproven anywhere; "promotion, not a build"
was verified at the pod level only.

- Add Chunk F: fix staging relay + prove ONE real crosspost in staging before
  touching prod (gives a known-good reference).
- Reorder: F first.
- Task A2: guard against minting fresh PLC/signing keys if prod PDS DB already
  has repos (orphan risk) — zero pods != empty DB.
- Mark this plan authoritative over the 05-30 plan's overlapping prod chunks.
- Runbook: downgrade "proven template" -> "config template, not proof of function".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(atbridge): POST PLC genesis op to /:did, not / (provisioning root cause)

create_did derived the did:plc but POSTed the genesis operation to the bare
root `/` instead of `/:did` as the PLC HTTP API requires. plc.directory returns
404 on `POST /`, which the bridge wrapped as a 500 — so account provisioning
has NEVER succeeded (staging account_links=0, zero crossposts ever). Verified
empirically against live plc.directory from the sky namespace: `POST /` -> 404,
`POST /did:plc:...` -> 400 "Not a valid operation" (correct route, reached the
validator).

Fix: POST to did_endpoint(&derived_did). Removed now-unused endpoint(). TDD:
new test create_did_posts_to_the_derived_did_path pins the route; updated the
create_did_* unit tests and the provision_api integration test, which had all
mocked the buggy `POST /` and thus masked the bug. 126 tests pass, clippy -D
warnings + fmt clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(plan): record PLC root cause; F2 blocked on fix; add Chunk G (error visibility)

- Chunk F2: first staging provision attempt FAILED on the PLC /:did bug (now
  fixed in code); a staging image from that commit must deploy before e2e works.
- Note staging ATPROTO_PROVISIONING_TOKEN=placeholder.
- Add Chunk G: provisioning errors are swallowed (Display not {:#}), making
  failures undebuggable in prod — fix with TDD.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(atbridge): derive did:plc from the SIGNED genesis op, not unsigned

derive_did_plc hashed the unsigned operation (sig stripped), but the did:plc
spec and the rsky reference compute base32(sha256(SIGNED op))[..24]. Since the
PLC directory returns an empty body on create, create_did falls back to the
derived DID AND posts the genesis op to POST /:did — so a wrong (unsigned)
derivation both mislabels the account and gets rejected by the directory
(DID/op mismatch). This compounds with the /:did route fix: provisioning needs
BOTH to succeed.

Keep unsigned_operation_bytes for signing (you sign the unsigned op); add
signed_operation_bytes for derivation. TDD: derive_did_plc_hashes_the_signed_operation
pins base32(sha256(signed dagcbor))[..24] and that the sig affects the DID.
127 tests pass, clippy -D warnings + fmt clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(plan): record 2nd PLC fix + audited next wall (PDS createAccount/activate)

Both PLC bugs (route + signed-op derivation) fixed and TDD-covered. Documented
the next likely failure after they deploy — rsky-pds createAccount with a
supplied did needs admin auth, maybe password/invite, and creates the account
DEACTIVATED (likely needs an activate step). Flagged to verify against the live
PDS rather than speculatively patch an unexercised path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(plan): audit walls 2-4 — confirmed activate-account gap + open publish-auth Q

Audited the never-run provision/publish path deeper against the rsky reference:
- Wall 3 (CONFIRMED bug): rsky creates did-import accounts deactivated=true; the
  bridge never calls activateAccount, so repos won't serve records. Fix is
  concrete (capture createAccount accessJwt, call activateAccount as that DID) but
  held until staging confirms the deactivated state empirically.
- Wall 4 (OPEN): publisher writes records with a single shared Bearer token, but
  repo writes must be authed as the repo DID — needs the auth_verifier behavior or
  a maintainer's design intent before building.
Deliberately not patching blind; documented for the live round-trip.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(runbook): turnkey build+deploy for the atbridge PLC-fix image (Gate 1)

Paste-ready build/push/overlay-bump/verify sequence with the real staging +
prod registry paths, so whoever has registry access can ship the fix and
unblock the e2e crosspost test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(plan): live-PDS evidence confirms Wall 3 (24/33 staging repos inactive)

Probed the deployed staging PDS: 33 repos with *.staging.dvines.org handles
(bridge HAS provisioned historically — corrects "never succeeded"), but 24 are
active:false — live proof the activate-account gap (Wall 3) is real and common.
Active repos only hold empty smoke-test stubs; no genuine video crosspost has
ever been produced. Deployed image 3ba324c still has the `/` PLC bug, so
provisioning has been broken since it deployed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant