Skip to content

feat: per-shard sharded Strata applies (Plan-stage fan-out + per-shard DDL)#535

Merged
morgo merged 13 commits into
block:mainfrom
morgo:morgo/sharded-apply-control-plane-fanout
Jun 26, 2026
Merged

feat: per-shard sharded Strata applies (Plan-stage fan-out + per-shard DDL)#535
morgo merged 13 commits into
block:mainfrom
morgo:morgo/sharded-apply-control-plane-fanout

Conversation

@morgo

@morgo morgo commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Replaces #531 (which fanned out on the data plane). Sharding is a Plan / control-plane concern: Plan declares what each shard needs (which won't always be the same — drift, or a partially-applied canary), the control plane fans that into per-shard operations and sequences the rollout, and the data plane executes one shard per dispatch.

A sharded Strata apply previously failed on the data plane with expected exactly one target shard, got 0, because a sharded plan dispatched over gRPC was sent whole. This PR makes the whole path work and simplifies the surrounding machinery.

What changed

1. Plan persists per-shard DDL

  • proto ShardPlan.changes carries each shard's own table changes across the gRPC boundary; storage.ShardPlan.Changes persists them in plan_data JSON.
  • LocalClient.Plan records each per-shard SchemaChange's changes; protoShardPlansToStorage reads them back. Because that per-shard data is remote/untrusted, both protoShardPlansToStorage and buildShardedApplyOperationGroups validate and fail closed before use — rejecting a nil/empty change, empty table or DDL, an unsupported change type, or a namespace that disagrees with the shard's namespace — rather than persisting or building from corrupt input.
  • A keyspace whose shards diverge (drift / partially-applied canary) is represented per shard rather than collapsed or rejected. The companion GAP change drops the strataengine's divergent-shard rejection.

2. needs_change removed — a shard is changing iff it has Changes

  • Drops the separate membership flag (proto + storage). It was a plan-time/apply-time TOCTOU: the flag was saved but the DDL was not. Now the reviewed per-shard DDL is persisted and is exactly what applies.

3. Fan-out gates on the plan, not a client capability

  • Removed ShardedApplyFanoutSupport (+ both impls and clientSupportsShardedApplyFanout). Fan-out gates solely on canBuildShardedOperationGroups (plan.Shards). Only the instance-local Strata engine emits per-shard ShardPlans, so an externally-authoritative engine (PlanetScale) — whose plans never carry per-shard changes — is never fanned out, regardless of transport.

4. Data plane honors the scoped per-shard dispatch

  • LocalClient.Apply builds tasks from the operation's own DdlChanges, tagged with TargetShards, so the engine gets exactly one target shard. A shard-scoped dispatch must carry valid, non-empty ddl_changes (scopedDispatchDDLChanges, fail closed) — and only table DDL (create/alter/drop): a VSchema change is keyspace-wide and rejected here. The single target shard is trimmed and validated (dispatchTargetShard), failing closed on >1 shard or an empty/whitespace shard. The non-sharded path is unchanged.

5. Task-less VSchema group_finalizer dispatched over gRPC

  • GRPCClient.ResumeApplyOperation now dispatches a finalizer op as a VSchema-only apply (no DDL, no target shards, carrying vschema.json) — the remote applies it via its task-less VSchema-only path — instead of stranding it on ErrNoTasksForApplyOperation.
  • Because a finalizer carries no task rows, the operator's task-derived operation→parent projection can never move its operation row. The drive therefore reflects the remote terminal state onto its own operation row (persistFinalizerOperationTerminalState: MarkCompleted/MarkFailed), mirroring the local driveGroupFinalizer; a retryable failure is left non-terminal for re-drive. Without this both terminal completion and failure were lost, stranding the operation in pending.

6. Drift guard removed (simplification)

  • Deleted verifyMaterializedPlanMatchesLiveSchema and its drift-multiset/vschema-parity helpers. It re-planned-and-compared on only the non-primary materialize path; the primary path already trusts its stored plan and reconciles via replanAndFilterTasks at apply/resume. The dispatch carries the reviewed plan verbatim, so the materialized plan is the reviewed plan, and drift is reconciled uniformly on every path (replan-and-filter + engine validation). This also removed the special case that rejected shard-scoped applies.
  • That replan-and-filter reconciliation is shard-aware: it keys each task's remaining DDL by (shard, table) (replanShardTableDDL), since a sharded re-plan repeats the same table across shards. Keying by table name alone would conflate per-shard tasks; a non-sharded engine emits an empty shard name, so it degrades to table-only and matches prior behavior.

Tests

  • scopedDispatchDDLChanges (scope honored / fails closed, incl. VSchema rejected); dispatchTargetShard (trim + validate single shard); buildShardedApplyOperationGroups per-shard DDL + skips shards without changes + fails closed on a malformed change; protoShardPlansToStorage carries per-shard changes + fails closed on nil/empty/unsupported-type/namespace-mismatch; replanShardTableDDL keys per (shard, table) and degrades to table-only when non-sharded; LocalClient.Plan records per-shard changes; GRPCClient.ResumeApplyOperation dispatches a finalizer as VSchema-only and reflects its terminal state (completed/failed) onto the operation row.
  • All pkg/tern, pkg/api, pkg/serve, pkg/storage suites pass.

Not in this PR (follow-ups)

  • Presentation: per-shard changes are on the plan/response, but rendering divergent shards in the PR-comment / approval view (collapse-when-uniform, expand-when-divergent) needs a UX decision.
  • Canary filtering: the operation model + ordering gate support sequencing a shard subset; the policy to do so is future work.
  • e2e for VSchema-over-gRPC: the finalizer dispatch is unit-tested for its request shape; the full remote VSchema apply + progress polling needs the k8s multideploy e2e to validate.

🤖 Generated with Claude Code

A sharded engine (e.g. GAP's Strata engine) requires exactly one target shard
per work operation. The control plane already fans a sharded plan out into one
apply_operation per shard (buildShardedApplyOperationGroups) and dispatches each
scoped to its shard, but only when the apply client advertises
ShardedApplyFanoutSupport — which GRPCClient did not, so a sharded plan
dispatched to a gRPC data plane was sent whole and rejected with "expected
exactly one target shard, got 0".

Fan-out belongs on the control plane: it owns rollout sequencing (canaries,
per-shard ordering) and per-shard drift reconciliation, which a data plane that
fans out for itself cannot do. So:

- GRPCClient.SupportsShardedApplyFanout reports true. The capability is
  unconditional and safe: apply-create only fans out when the stored plan
  carries changing shards, which non-sharded engines never produce, so a
  non-sharded plan is dispatched whole exactly as before.
- LocalClient.Apply honors the per-shard scoped dispatch: it builds tasks from
  the operation's own req.DdlChanges (falling back to the whole plan for a local
  apply) and tags them with req.TargetShards when the dispatch targets a single
  shard, so the engine receives exactly one target shard. No data-plane fan-out
  and no operator claim-mode change — the data plane keeps its inline,
  apply-level drive, one shard per dispatch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@morgo morgo requested a review from aparajon as a code owner June 25, 2026 21:04
Copilot AI review requested due to automatic review settings June 25, 2026 21:04
@morgo morgo requested a review from Kiran01bm as a code owner June 25, 2026 21:04

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR moves sharded-apply fan-out to the control plane for gRPC data-plane executes, by having the gRPC client advertise sharded fan-out support and updating the data-plane LocalClient.Apply path to honor per-operation scoped dispatch (DDL subset + single target shard).

Changes:

  • Add GRPCClient.SupportsShardedApplyFanout() so control-plane apply creation can fan sharded plans out into per-shard operations when using gRPC.
  • Update LocalClient.Apply to build tasks from ApplyRequest.DdlChanges when present (scoped dispatch), and tag tasks with a single dispatched shard from ApplyRequest.TargetShards.
  • Add focused tests for the new capability advertisement and the scoped-DDL vs plan-fallback behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
pkg/tern/sharded_apply_dispatch_test.go Adds tests asserting gRPC fan-out capability and that LocalClient honors dispatched DDL scope.
pkg/tern/local_client.go Builds tasks from dispatched DdlChanges and tags tasks with a single dispatched shard.
pkg/tern/grpc_client.go Advertises sharded-apply fan-out capability for the gRPC client.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/tern/local_client.go Outdated
Comment thread pkg/tern/local_client.go Outdated
Comment thread pkg/tern/grpc_client.go Outdated
Plan is where sharding is declarative: a keyspace's shards won't always need
the same change (drift, or a partially-applied canary rollout). Previously the
plan collapsed every changing shard to one namespace-level DDL set and the
fan-out applied that uniform DDL to each shard. Carry each shard's own changes
end to end instead.

- proto: ShardPlan gains `repeated TableChange changes`, so per-shard DDL
  crosses the gRPC boundary to the control plane that fans out.
- storage.ShardPlan gains `Changes []TableChange` (persisted in plan_data JSON;
  empty means "use the namespace-level changes", so old plans and the uniform
  case are unchanged).
- LocalClient.Plan records each per-shard SchemaChange's table changes onto its
  ShardPlan (storage + response); protoShardPlansToStorage reads them back.
- buildShardedApplyOperationGroups builds each shard's tasks from that shard's
  own changes, falling back to the namespace-level changes when a shard carries
  none.

The data-plane apply path already drives each operation from its own
DdlChanges, so divergent per-shard DDL flows through to execution unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@morgo morgo changed the title feat(tern): control-plane sharded apply fan-out over gRPC feat: per-shard sharded Strata applies (control-plane fan-out + per-shard DDL) Jun 25, 2026
morgo and others added 3 commits June 25, 2026 15:45
LocalClient.Apply honored req.DdlChanges for every dispatch, which changed the
non-sharded path (it had used the stored plan) and regressed TestK8s_Progress:
a non-sharded mysql apply reported the wrong table because it built tasks from
the request rather than the plan.

Gate the per-shard behavior on the dispatch actually being sharded: only when
the request carries target shards does Apply drive the operation's own
DdlChanges and tag tasks with the shard. A whole-deployment or non-sharded
apply uses the stored plan unchanged — byte-for-byte the prior behavior. A
dispatch with more than one target shard is malformed (the fan-out emits one
shard per operation) and now fails closed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nges

ShardPlan.needs_change was a separate membership flag carried over from the
pre-per-shard design, where the plan recorded that a shard needed a change but
not the change itself — a plan-time/apply-time TOCTOU (the saved flag could
disagree with a re-derived DDL at apply time). Now that each shard carries its
own changes, membership is implied: a shard is changing iff len(changes) > 0.

- proto: ShardPlan drops needs_change (field 3 reserved).
- storage.ShardPlan drops NeedsChange.
- changingShardsByNamespace filters on len(Changes) > 0.
- buildShardedApplyOperationGroups drops the now-unreachable namespace-level
  fallback (every changing shard carries its own changes) and its taskChanges
  parameter.
- LocalClient.Plan / protoShardPlansToStorage stop setting NeedsChange.

A shard with no changes simply produces no operation. Tests updated to mark
changing shards by their changes rather than a flag.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 13 changed files in this pull request and generated 4 comments.

Files not reviewed (1)
  • pkg/proto/ternv1/tern.pb.go: Generated file

Comment thread pkg/api/proto_helpers.go Outdated
Comment thread pkg/storage/types.go
Comment thread pkg/api/plan_handlers.go
Comment thread pkg/tern/local_client.go
morgo and others added 3 commits June 25, 2026 16:58
…nalizers over gRPC

Addresses the review feedback on the per-shard apply work by removing redundant
machinery and closing the one real gap. No existing users depend on the removed
paths, so this refactors freely rather than preserving compatibility shims.

1. Remove the materialize-time drift guard
   verifyMaterializedPlanMatchesLiveSchema re-planned against live schema and
   fanned closed on any difference, but only on the non-primary materialize
   path — the primary path already trusts its stored plan and reconciles via
   replanAndFilterTasks at apply/resume. The dispatch carries the reviewed plan's
   DDL verbatim, so the materialized plan IS the reviewed plan; drift between
   review and apply is reconciled the same way on every path (replan-and-filter
   plus engine validation at execution). This also removes the special-case that
   rejected shard-scoped applies.

2. Gate fan-out on the plan, not a client capability
   Drop ShardedApplyFanoutSupport (+ GRPCClient/LocalClient impls and
   clientSupportsShardedApplyFanout). Fan-out is gated solely on
   canBuildShardedOperationGroups (plan.Shards). Only an instance-local sharded
   engine (Strata) emits per-shard ShardPlans, so an externally-authoritative
   engine (PlanetScale) — whose plans never carry per-shard changes — is never
   fanned out, regardless of transport. This removes the unconditional
   GRPCClient.SupportsShardedApplyFanout() the review flagged.

3. Dispatch task-less group_finalizer operations over gRPC
   GRPCClient.ResumeApplyOperation treated a finalizer's empty task set as an
   error, stranding VSchema-changing sharded applies on a remote data plane. It
   now dispatches the namespace VSchema as a VSchema-only apply (no DDL, no
   target shards, carrying the vschema.json), which the data plane applies via
   its task-less VSchema-only path — the remote counterpart to
   LocalClient.driveGroupFinalizer.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two review fixes for the per-shard dispatch path:

- protoShardPlansToStorage now returns an error on a nil entry in
  ShardPlan.changes instead of silently skipping it. It is data-plane input over
  gRPC, so a nil change is malformed and must not yield an incomplete plan.
- A shard-scoped apply (TargetShards set) now requires valid, non-empty
  ddl_changes (scopedDispatchDDLChanges): non-nil entries with a non-empty
  table/DDL and a supported change type. It no longer falls back to the whole
  stored plan, which could have applied unrelated tables on the targeted shard.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@morgo morgo requested a review from Copilot June 26, 2026 01:18
@morgo morgo changed the title feat: per-shard sharded Strata applies (control-plane fan-out + per-shard DDL) feat: per-shard sharded Strata applies (Plan-stage fan-out + per-shard DDL) Jun 26, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 17 changed files in this pull request and generated 3 comments.

Files not reviewed (1)
  • pkg/proto/ternv1/tern.pb.go: Generated file

Comment thread pkg/api/plan_handlers.go
Comment thread pkg/api/plan_handlers.go
Comment thread pkg/api/proto_helpers.go
…alidation

- Resume/recovery re-plan now keys remaining DDL by (shard, table) instead of
  table name alone. A sharded re-plan repeats the same table across shards, so
  table-only keying conflated per-shard tasks (one shard's diff could keep
  another shard's task active or overwrite its DDL). Non-sharded engines emit an
  empty shard name, so keying degrades to table-only and matches prior behavior.
- protoShardPlansToStorage fails closed on remote/untrusted per-shard changes:
  nil entry, empty table/DDL, unsupported change type, or namespace mismatch.
- buildShardedApplyOperationGroups validates each shard's changes before building
  operation keys / engine requests, rejecting empty or mismatched fields.
- Tests for the per-shard key, non-sharded degradation, and both fail-closed paths.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 19 changed files in this pull request and generated 3 comments.

Files not reviewed (1)
  • pkg/proto/ternv1/tern.pb.go: Generated file

Comment thread pkg/tern/grpc_client.go
Comment thread pkg/tern/local_client.go
Comment thread pkg/tern/local_client.go
…shard dispatch

- A task-less group_finalizer in a multi-operation apply carries no task rows,
  so the operator's task-derived operation→parent projection can never move its
  operation row off pending: both terminal completion AND terminal failure were
  silently lost, stranding the operation and causing endless operator retries.
  The remote drive now reflects the remote terminal state onto its own operation
  row (persistFinalizerOperationTerminalState: MarkCompleted/MarkFailed), and a
  retryable failure is left non-terminal for re-drive — mirroring the local
  driveGroupFinalizer. Gated by scope.finalizerOperationScope().
- scopedDispatchDDLChanges now rejects CHANGE_TYPE_VSCHEMA: a shard-scoped
  dispatch carries only table DDL (create/alter/drop); a VSchema update is
  keyspace-wide and applied by the finalizer, never shard-tagged.
- Extracted dispatchTargetShard, which trims and validates the single target
  shard and fails closed on an empty/whitespace shard (an invalid scope the
  engine would reject or mis-execute); consolidates the prior >1-shard check.
- Tests: finalizer terminal completed/failed reflected onto the operation row;
  vschema rejected by scoped dispatch; dispatchTargetShard trim/validation.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 19 changed files in this pull request and generated 9 comments.

Files not reviewed (1)
  • pkg/proto/ternv1/tern.pb.go: Generated file

Comment thread pkg/tern/local_control_resume.go Outdated
Comment thread pkg/tern/local_control_resume.go Outdated
Comment thread pkg/tern/local_control_resume_test.go Outdated
Comment thread pkg/tern/local_control_resume_test.go Outdated
Comment thread pkg/tern/local_client.go
Comment thread pkg/api/proto_helpers.go
Comment thread pkg/storage/mysqlstore/plans_test.go
Comment thread pkg/storage/mysqlstore/plans_test.go Outdated
Comment thread pkg/tern/local_control_resume.go Outdated
morgo and others added 2 commits June 25, 2026 20:43
…-control-plane-fanout

# Conflicts:
#	pkg/proto/ternv1/tern.pb.go
- The resume/recovery re-plan key now includes namespace: shardTableKey is
  (namespace, shard, table) and replanShardTableDDL / tableStillNeedsChange /
  replanAndFilterTasks key by the full tuple. A plan is keyed by (Namespace,
  Shard), so in a multi-keyspace Vitess apply the same shard+table can exist in
  two namespaces; keying by (shard, table) alone would reconcile a task against
  another keyspace's diff. Non-sharded engines emit an empty shard, so it
  degrades to (namespace, table).
- scopedDispatchDDLChanges and protoShardPlansToStorage now trim namespace/table/
  DDL before validating and storing, instead of validating a trimmed copy but
  persisting the untrimmed value. These feed operation keys and task rows, so
  leaked surrounding whitespace caused surprising keys and reconciliation/progress
  mismatches. scopedDispatchDDLChanges also fails closed on an empty namespace
  (authoritative scope for a shard-scoped dispatch).
- Tests: re-plan key distinguishes same shard+table across keyspaces; scoped
  dispatch trims values and rejects empty namespace; the mysqlstore plans
  round-trip now carries and asserts per-shard ShardPlan.Changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 19 changed files in this pull request and generated 1 comment.

Files not reviewed (1)
  • pkg/proto/ternv1/tern.pb.go: Generated file

Comment thread pkg/api/proto_helpers.go
Comment thread pkg/proto/tern.proto Outdated
Per reviewer: this proto has no released consumers yet, so the needs_change
removal can be a clean breaking change rather than carrying a reserved marker.
Drop `reserved 3 / "needs_change"` and renumber `changes` 4 -> 3. Regenerated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@morgo morgo merged commit 8b46d19 into block:main Jun 26, 2026
32 checks passed
@morgo morgo deleted the morgo/sharded-apply-control-plane-fanout branch June 26, 2026 11:48
morgo added a commit to morgo/schemabot that referenced this pull request Jun 26, 2026
The plan comment was keyspace-level: a divergent sharded plan (shards needing
different changes — drift, or a partially-applied canary) collapsed to one DDL
block, so an operator couldn't see what would apply where. The per-shard plan
data block#535 persists never reached the comment — apitypes.PlanResponse carried
only the namespace-level Changes.

- apitypes.PlanResponse gains Shards ([]ShardPlanResponse), and
  planResponseFromProto carries them from the proto plan response.
- KeyspaceChangeData gains per-shard Shards; buildPlanCommentData threads them.
- writeKeyspaceChanges groups a sharded keyspace's shards by change signature:
  uniform shows the DDL once; divergent renders "what applies where" — one DDL
  block per distinct change set, labelled with the shards it applies to.

No gap change needed: the per-shard data already flows from gap's Strata engine
through the proto plan response; only the control-plane rendering changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
morgo added a commit that referenced this pull request Jun 26, 2026
…oup-by-variant) (#546)

* feat(webhook): dedicated sharded-apply comment (per-shard, group-by-variant)

A sharded apply fans out across the shards of one keyspace within a single
deployment, but the comment renderer reused the multi-deployment layout, whose
unit is the deployment. Every per-shard operation shares the one deployment
("cake"), so all shards rendered as identical "cake" rows, and — because the
per-deployment Details map is keyed by deployment name — the per-shard detail
collided and a failed shard's error was dropped (it showed a pending sibling's
body instead). That is why a failed rollout surfaced no error.

Add a dedicated sharded layout:
- pkg/webhook/templates/sharded_apply.go: RenderShardedApplyComment. Shards are
  grouped by change signature — a uniform apply renders one Shard|Status table
  with the DDL once; divergent shards (different tables, or the same table
  computing to different DDL because shards drifted) render one group per
  distinct change set, each tying its shards' statuses to the DDL it runs. The
  first failed shard's error is lifted to the top and shown in its row.
- pkg/webhook/sharded_apply.go: detect a sharded apply (work ops keyed
  namespace/shard/table, one deployment), and build the comment input from the
  operation rows + tasks. Per-shard status is derived through pkg/presentation
  with the shard name as the operation identity, so ordering labels reference
  shards ("waiting for -40", "halted — -40 failed") with no presentation change.
  formatApply{Status,Summary}Comment route sharded applies here.

Finalizer (VSchema) operations are out of scope for the shard view for now;
their outcome is still reflected in the aggregate headline state.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* Address Copilot: handle failed_retryable in the sharded comment

The sharded layout only recognised the terminal Failed state, so an apply that
SchemaBot is auto-retrying (failed_retryable) lost its error and its next-action
guidance. Mirror the single-deployment renderer:
- isShardFailureState covers Failed and FailedRetryable; the first-failure
  callout and the per-shard status cell both surface the error for either.
- writeShardedFooter adds the FailedRetryable case with the "stop retrying"
  command, matching writeApplyFooter.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(plan): show per-shard changes in the sharded plan comment

The plan comment was keyspace-level: a divergent sharded plan (shards needing
different changes — drift, or a partially-applied canary) collapsed to one DDL
block, so an operator couldn't see what would apply where. The per-shard plan
data #535 persists never reached the comment — apitypes.PlanResponse carried
only the namespace-level Changes.

- apitypes.PlanResponse gains Shards ([]ShardPlanResponse), and
  planResponseFromProto carries them from the proto plan response.
- KeyspaceChangeData gains per-shard Shards; buildPlanCommentData threads them.
- writeKeyspaceChanges groups a sharded keyspace's shards by change signature:
  uniform shows the DDL once; divergent renders "what applies where" — one DDL
  block per distinct change set, labelled with the shards it applies to.

No gap change needed: the per-shard data already flows from gap's Strata engine
through the proto plan response; only the control-plane rendering changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* Address Copilot: join all task DDLs per shard cell

buildShardedApplyData took only the first task's DDL for an operation. An
operation can carry more than one task for its (namespace, shard, table) when a
shard plan yields multiple statements for the same table, so this dropped
statements and corrupted the change signature used to group shards. Join every
non-empty task DDL in task order instead. Covered by
TestBuildShardedApplyData_JoinsMultiTaskDDL.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* Address Copilot: single-keyspace guard + count per-shard plan DDL

- isShardedApply now also requires every shard work operation to share one
  namespace. A multi-keyspace apply (still one deployment) would otherwise route
  to the sharded layout and mislabel the keyspace / mix unrelated shard groups;
  it now falls back to the existing layout.
- countChanges counts a keyspace's per-shard statements when the collapsed
  namespace-level Statements are absent (keyspaceStatementCount), so a sharded
  plan whose only DDL is per-shard is not miscounted as "no changes" and gets
  correct summary counts.
- Tests for multi-keyspace fallback and per-shard-only plan counting.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(sharded): per-shard unsafe in plan; surface failed-shard error in apply

Apply comment: a remote failure records its error on the operation's task, and
the operator may not stamp it onto the operation row — so the sharded apply
comment showed a failed shard with no error (forcing a trip to logs).
aggregateShardState now falls back to the first task error when the operation
row carries none, so a failed shard always shows why.

Plan comment: an unsafe change confined to one shard (e.g. a column drop on a
single drifted shard) is now flagged with that shard. shardedUnsafeChanges
derives unsafe changes from the per-shard plan (grouped by table+reason → shards)
when present, since the collapsed namespace-level Changes can omit a shard's
change; the unsafe-changes warning names the shard(s).

Tests: apply falls back to task error; plan surfaces a per-shard unsafe change
scoped to its shard (both the builder and the rendered warning).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* Address Copilot + add control-plane shard-scope guard

Copilot fixes:
- parseShardOperationKey rejects keys with extra segments (strict 3-part split),
  so "ns/-40/table/extra" is no longer misclassified as shard work.
- buildShardedApplyData sorts each operation's tasks by id before joining DDL,
  so the joined DDL and the change signature are deterministic (tasks arrive
  created_at DESC). In practice an operation has one task — multiple statements
  for one table are combined into one ALTER upstream — so this is defensive.
- planResponseFromProto drops shard plans with no changes, and buildPlanCommentData
  skips per-shard entries with no DDL, so an empty shard never renders a blank
  section or inflates the change count (a shard is changing iff it has changes).

Control-plane fail-closed guard (requested):
- dispatchPendingApply refuses to dispatch a shard work operation
  ("namespace/shard/table") that resolves no target shard, with a clear error
  naming the skew, instead of sending an empty TargetShards that the data plane
  rejects opaquely as "expected exactly one target shard, got 0".

Also correct the per-shard unsafe test/preview: a same-table change is one
combined ALTER, not multiple statements (multi-statement-per-table is unsupported).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(templates): add Sharded Apply previews to TEMPLATES.md

The sharded apply/plan comments weren't represented in the generated
TEMPLATES.md catalog. Add preview renderings (pkg/webhook/templates/
preview_sharded.go) for the divergent plan, the per-shard unsafe-change plan,
and the in-progress / failed / divergent apply comments — per-shard statuses
derived through pkg/presentation so they match production — wire a
comment_sharded_all aggregate preview type, and add a "Sharded Apply" section to
update-templates.sh. Regenerated TEMPLATES.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants