Skip to content

feat(tern): shard-aware fail-closed drift guard on materialize#596

Open
Kiran01bm wants to merge 3 commits into
mainfrom
kiran01bm/dg-4-drift-comparator
Open

feat(tern): shard-aware fail-closed drift guard on materialize#596
Kiran01bm wants to merge 3 commits into
mainfrom
kiran01bm/dg-4-drift-comparator

Conversation

@Kiran01bm

Copy link
Copy Markdown
Collaborator

Summary

Restores the materialize-time drift guard that was removed in #535, this time shard-aware. A non-primary deployment that never planned locally must not silently apply the primary's reviewed DDL against a schema that has drifted; this re-adds the fail-closed check, now compatible with per-shard apply fan-out.

What

  • Recompute this deployment's own diff against its live schema at materialize time and refuse unless it exactly matches the dispatched (reviewed) DDL, compared as an exact multiset keyed by (namespace, shard, table, operation, canonical DDL).
  • A shard-scoped dispatch is compared against the re-plan restricted to its shard; a whole-deployment dispatch against the non-sharded changes. VSchema parity is checked per namespace for whole-deployment applies.
  • DDL that cannot be parsed, a nil change, or a malformed multi-shard dispatch all fail closed.

Why

#535 deleted the guard because it rejected shard-scoped applies, which were incompatible with per-shard fan-out. The interim reconciliation (re-plan at apply/resume) is not fail-closed: it silently completes drifted tables or applies locally recomputed DDL that no reviewer saw, breaking the "what gets applied is what was reviewed" boundary on every non-primary deployment. Making the comparator shard-aware lets the guard return.

    materialize remote plan (non-primary, no local plan)
                      |
                      v
      re-plan THIS deployment vs its live schema
      (restricted to the dispatch's shard scope)
                      |
           exact multiset match vs reviewed DDL?
             |- yes -> materialize -> apply
             |- no  -> FAIL CLOSED (never persisted, never applied)

#535 removed the materialize-time drift guard because it rejected
shard-scoped applies, leaving non-primary deployments to silently apply
unreviewed DDL. Restore it shard-aware: compare the dispatched reviewed
DDL against this deployment's re-plan restricted to the dispatch's shard,
failing closed on any mismatch.
Copilot AI review requested due to automatic review settings June 30, 2026 04:43

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Reintroduces a fail-closed “materialize-time” drift guard for non-primary deployments in Tern, updated to support per-shard apply fan-out by comparing recomputed local plans against the dispatched (reviewed) DDL on a shard-aware basis.

Changes:

  • Add shard-aware drift comparison utilities (exact multiset match keyed by namespace/shard/table/op/canonical DDL) plus VSchema parity checks for whole-deployment dispatches.
  • Invoke the drift guard during apply-request plan materialization (non-primary path) to prevent applying reviewed DDL against drifted live schema.
  • Add unit tests covering match, canonicalization tolerance, shard-scoped behavior, malformed shard scope, VSchema parity, and recompute errors.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
pkg/tern/local_plan_drift.go Implements shard-aware drift guard (multiset comparison + optional VSchema parity).
pkg/tern/local_plan_drift_guard_test.go Adds unit tests for drift guard behavior across sharded/non-sharded and VSchema cases.
pkg/tern/local_client.go Calls drift guard during plan materialization from dispatch requests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/tern/local_plan_drift.go
canonicalDDLForDrift relied on helpers that canonicalize only the first
parsed statement, so a multi-statement payload could mask drift on the
trailing statements. Enforce exactly one statement.
@Kiran01bm Kiran01bm marked this pull request as ready for review June 30, 2026 06:44
@Kiran01bm Kiran01bm requested review from aparajon and morgo as code owners June 30, 2026 06:44
@aparajon

Copy link
Copy Markdown
Collaborator

Review summary

Verified the property #535 was worried about: the shard key is symmetric across the comparison. Both sides of the multiset read the shard from the same engine.SchemaChange.Shard.Name the engine emits — the dispatch's TargetShards is rebuilt from that exact field (local_client.go shard extraction → ShardPlan.ShardTask.ShardtaskTargetShards), and the re-plan reads it directly in driftMultisetFromPlanResult. There's no independent re-derivation, so a legitimate shard-scoped apply won't permanently fail closed. Also confirmed: this genuinely fixes the gap #535 opened — the old hard-reject of shard-scoped applies is replaced by a shard-restricted re-plan comparison, so those applies are now checked rather than refused; the namespace-level routing (group-finalizer) bypasses the guard entirely, so it isn't mis-compared as shard work; and the fail-closed direction is right (unparseable DDL, nil change, multi-statement payload, empty/whitespace shard, and >1 target shard all refuse to materialize). No tests were removed or weakened, and the multi-statement-DDL concern is addressed on head via statement.Classify.

A few non-blocking items:

  1. Tell the operator how to recover (worth landing). The fail-closed error states what drifted but not what to do next. A blocked operator currently hits a wall. Suggest the message point at the recovery path — e.g. "re-plan against the current schema to refresh the reviewed plan, then re-apply" — so there's a defined next step.

  2. Document the engine contract the symmetry rests on. The guard's correctness for sharded engines presumes the engine emits a stable Shard.Name equal to the value the fan-out keyed its operations on. That assumption already governs the existing shard plumbing, but a one-line note on driftChangeKey would make it explicit for future readers.

  3. VSchema parity compares two representations across a boundary. The re-plan side keys on Metadata["vschema_changed"] == "true" while the dispatch side keys on the proto CHANGE_TYPE_VSCHEMA. Consistent today, and it diverges in the fail-closed direction, so it's safe — just a soft coupling worth a comment.

  4. Forward-compat for shard-drift reconciliation. This guard couples "what may be applied" to "what re-planning live-vs-desired reproduces." That's fully compatible with reconciling shard drift by re-planning each drifted shard against the canonical/desired schema and reviewing the resulting per-shard repair — the guard makes that flow safe rather than blocking it. The one thing to keep in mind: if a future reconciliation path ever applies repair DDL derived some other way (e.g. shard-to-shard comparison, or replaying a stored repair) that a live-vs-desired re-plan wouldn't recompute, it would need to route through a normal reviewed plan or get an explicit reconciliation mode in the guard. Worth noting so the boundary stays intentional.

Verdict: LGTM with nits — the shard-key symmetry holds, it restores the reviewed-DDL boundary it set out to, and it doesn't constrain a future shard-drift reconciliation feature. Item 1 is the only one I'd push to land with the merge.

🤖 This review was generated by Claude Code (claude-opus-4-8) with maintainer approval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants