Support rollback for ManifestWorkReplicaSet #164

youngbupark · 2025-11-12T05:42:00Z

Support rollback for ManifestWorkReplicaSet

openshift-ci · 2025-11-12T05:42:04Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: youngbupark
Once this PR has been reviewed and has the lgtm label, please assign deads2k for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

youngbupark · 2025-11-12T05:47:38Z

enhancements/sig-architecture/229-manifestworkreplicaset-rollback/README.md

+
+## Design Details
+
+The `abort` operation cancels the progressing rollout and rolls it back to the stable revision automatically without updating `.spec.manifestWorkTemplate`, whereas `rollback` requires explicit manual action by updating `.spec.manifestWorkTemplate`.


I am following the same concept of abort and rollback of argo rollout. See this - https://argo-rollouts.readthedocs.io/en/stable/getting-started/#4-aborting-a-rollout

youngbupark · 2025-11-12T05:48:39Z

enhancements/sig-architecture/229-manifestworkreplicaset-rollback/README.md

+
+### ManifestWorkReplicaSet API Object
+
+To support abort and rollback feature, we will have the following API changes:


I wonder if I need to split the enhancement for rollback and abort features.

youngbupark · 2025-11-12T05:50:38Z

enhancements/sig-architecture/229-manifestworkreplicaset-rollback/README.md

+
+## Summary
+
+This enhancement proposes adding rollback capabilities to ManifestWorkReplicaSet (MWRS) to enable safe recovery from failed multi-cluster deployments. The solution leverages Kubernetes ControllerRevision resources to maintain a historical record of MWRS template changes, similar to how StatefulSet and DaemonSet controllers track revision history. 


I will update the plugin enhancment proposal after this proposal is approved - #160 I would rename rollback to abort in plugin hook name.

haoqing0110 · 2025-11-12T09:02:24Z

enhancements/sig-architecture/229-manifestworkreplicaset-rollback/README.md

+        progressive:
+          ...
+        # (NEW) abortOnFailure: if true, automatically aborts and rolls back on failure; otherwise, rollout ends on failure
+        abortOnFailure: true


Will abort follow the same rolloutStrategy? Like auto rollback all at once, Progressive.

In the first iteration, we will auto abort to all clusters for the simplicity. Please let me know if you think we still need to follow rolloutStrategy for abort.

Abort to all clusters will make things easier as a start I think, just need to clarify the behavior in the doc.

I would define it as failureStrategy, with abortAll or rollbackAll as the strategy name. Try to avoid boolean type.

ok good. let me update it.

haoqing0110 · 2025-11-12T09:13:08Z

enhancements/sig-architecture/229-manifestworkreplicaset-rollback/README.md

+| PlacementVerified | AsExpected, PlacementDecisionNotFound, PlacementDecisionEmpty, NotAsExpected | Indicates if Placement is valid |
+| ManifestworkApplied | AsExpected, NotAsExpected, Processing | A ManifestWork has been created in each cluster defined by PlacementDecision |
+| PlacementRollOut | Progressing, Complete, `(NEW)` RolloutDegraded | Indicates if RollOut Strategy is complete |
+| `(NEW)` Progressing | NewRevisionCreated, FoundNewRevision, NewManifestWorkAvailable, RolloutAborted, ProgressDeadlineExceeded | The rollout is progressing. Progress for a rollout is considered when a new manifest is created or adopted, when new manifestwork rolls out |


Can the new condition Progressing be merged with PlacementRollOut? Let NewRevisionCreated, FoundNewRevision, NewManifestWorkAvailable, RolloutAborted, ProgressDeadlineExceeded replace reason
Progressing?

PlacementRolledOut = True should be the indicator of Ready status, based on this comment.

If we merge the condition Progressing into PlacementRolledOut, we will lose track of whether the MWRS is Progressing, unless we check the Reason. When PlacementRolledOut Reason is RolloutDegraded, assuming we will set PlacementRolled = False. But this way, we cannot tell if it is Progressing still or Failed.

enhancements/sig-architecture/229-manifestworkreplicaset-rollback/README.md

haoqing0110 · 2025-11-12T09:25:52Z

enhancements/sig-architecture/229-manifestworkreplicaset-rollback/README.md

+
+`Abort` is used to **cancel** the current rollout. When aborting the current rollout, spec is not changed, but `status.abort` is set to `true`, which will set `status.abortedTime`. then loading the older revision and then apply `.spec.manifestWorkTemplate` of the old revision to all clusters at once.
+
+On the other hand, `rollback` is the explicit action. If user wants to roll back to the older revision, the user (or cli) can find the revision from the list of ControllerRevision resources and apply the older manifestTemplate to update the `.spec.manifestWorkTemplate`.


Does this mean the user can not use Revision to manual rollout? but need to copy and update the .spec.manifestWorkTemplate.

@haoqing0110 kubectl undo for deployment/stateful/daemonset first finds the older revision and update its pod template in the resource. core controllers don't have the functionality to change spec due to the potential race condition. Ideally, clusteradm should have the ability to rollback in the future for the better UX.

annelaucg · 2025-11-13T20:01:35Z

enhancements/sig-architecture/229-manifestworkreplicaset-rollback/README.md

+### Non-Goals
+
+- Manual rollback via kubectl plugin commands (rollback will be performed by updating `.spec.manifestWorkTemplate`)
+- Rollback of CRD versions (automatic rollback should be used with caution when CRDs are involved)


Does this mean CRDs deployed by MWRS will not be included in the rollback?

It will support any resources, however, we will need to add the caution in the document in the future. I would delete this part.

annelaucg · 2025-11-13T20:03:41Z

enhancements/sig-architecture/229-manifestworkreplicaset-rollback/README.md

+   - **Mitigation**: Using the automatic rollback feature (`abortOnFailure`) should be opt-in and used with caution. Users should be aware of the implications when rolling back CRD changes.
+
+2. **Storage Growth Risk** - Maintaining revision history could lead to unbounded storage growth.
+   - **Mitigation**: The `revisionHistoryLimit` field allows users to control the maximum number of revisions retained. The controller will automatically prune old revisions.


Is there any maximum that the user can define?

I haven't thought about the number. it might be 10 ?

enhancements/sig-architecture/229-manifestworkreplicaset-rollback/README.md

annelaucg · 2025-11-13T20:16:55Z

enhancements/sig-architecture/229-manifestworkreplicaset-rollback/README.md

+| PlacementVerified | AsExpected, PlacementDecisionNotFound, PlacementDecisionEmpty, NotAsExpected | Indicates if Placement is valid |
+| ManifestworkApplied | AsExpected, NotAsExpected, Processing | A ManifestWork has been created in each cluster defined by PlacementDecision |
+| PlacementRollOut | Progressing, Complete, `(NEW)` RolloutDegraded | Indicates if RollOut Strategy is complete |
+| `(NEW)` Progressing | NewRevisionCreated, FoundNewRevision, NewManifestWorkAvailable, RolloutAborted, ProgressDeadlineExceeded | The rollout is progressing. Progress for a rollout is considered when a new manifest is created or adopted, when new manifestwork rolls out |


PlacementRolledOut = True should be the indicator of Ready status, based on this comment.

If we merge the condition Progressing into PlacementRolledOut, we will lose track of whether the MWRS is Progressing, unless we check the Reason. When PlacementRolledOut Reason is RolloutDegraded, assuming we will set PlacementRolled = False. But this way, we cannot tell if it is Progressing still or Failed.

enhancements/sig-architecture/229-manifestworkreplicaset-rollback/README.md

annelaucg · 2025-11-13T20:18:10Z

enhancements/sig-architecture/229-manifestworkreplicaset-rollback/README.md

+
+### `Abort` vs `Rollback`
+
+`Abort` is used to **cancel** the current rollout. When aborting the current rollout, spec is not changed, but `status.abort` is set to `true`, which will set `status.abortedTime`. then loading the older revision and then apply `.spec.manifestWorkTemplate` of the old revision to all clusters at once.


nit: MWRS.spec is not changed

annelaucg · 2025-11-13T20:23:12Z

enhancements/sig-architecture/229-manifestworkreplicaset-rollback/README.md

+    1. If history creation failed:
+        1. If it is because of name collision:
+            1. If the collided history is same as `ManifestWorkReplicaSet`'s `.spec.ManifestWorkTemplate` desired state, there is the already created the history
+            1. Otherwise, bump ManifestWorkReplicaSet `.status.collisionCount` by 1, 


Is the intent to bump the .status.collisionCount and in the next reconcile loop, the ControllerRevision will get created?

annelaucg · 2025-11-13T20:25:36Z

enhancements/sig-architecture/229-manifestworkreplicaset-rollback/README.md

+        1. [Determine cluster rollout status (Failed, Success)](https://github.com/open-cluster-management-io/ocm/blob/main/pkg/work/hub/controllers/manifestworkreplicasetcontroller/manifestworkreplicaset_deploy_reconcile.go#L96)
+    1. Create Rollout handler
+    1. Calculate RolloutResult (rollout/timeout/removed candidate clusters)
+    1. `(NEW)` If rolloutResult includes timeout clusters and `.spec.placementRefs[*].rolloutStrategy.abortOnFailure` is true


In argo rollout, do timed out pods also trigger abort/rollout action?

No argo rollout will trigger abort action only if analysis step is failed. Timeout abort is opt-in feature in argo rollout. We need to set progressDeadlineAbort: true explicitly. Same here, automatic abort will be opt-in feature, not opt-out.

qiujian16

thanks, I'd like to have two examples on how mwrs status looks like during rollback:

upgrade and abort, how will the status changes (conditions/reasons/summary etc) among each state. And how user can know whether a rollback is done in which revision?
manual rollback, how user set rollback and check if rollback is finished.

qiujian16 · 2025-11-17T03:20:08Z

enhancements/sig-architecture/229-manifestworkreplicaset-rollback/README.md

+        progressive:
+          ...
+        # (NEW) abortOnFailure: if true, automatically aborts and rolls back on failure; otherwise, rollout ends on failure
+        abortOnFailure: true


I would define it as failureStrategy, with abortAll or rollbackAll as the strategy name. Try to avoid boolean type.

enhancements/sig-architecture/229-manifestworkreplicaset-rollback/README.md

qiujian16 · 2025-11-17T03:29:24Z

enhancements/sig-architecture/229-manifestworkreplicaset-rollback/README.md

+
+### Open Questions
+
+1. Should we have the delay before starting aborting a rollout ?


sounds reasonable, but we could consider this as an additional abort strategy in beta phase.

haoqing0110 · 2025-11-20T05:30:16Z

enhancements/sig-architecture/229-manifestworkreplicaset-rollback/README.md

+        1. [Determine cluster rollout status (Failed, Success)](https://github.com/open-cluster-management-io/ocm/blob/main/pkg/work/hub/controllers/manifestworkreplicasetcontroller/manifestworkreplicaset_deploy_reconcile.go#L96)
+    1. Create Rollout handler
+    1. Calculate RolloutResult (rollout/timeout/removed candidate clusters)
+    1. `(NEW)` If rolloutResult includes timeout clusters and `.spec.placementRefs[*].rolloutStrategy.abortOnFailure` is true


What's the condition to trigger automatic abort? How about the failed clusters?

When minSuccessTime is exceeded for the current rollout group, it will start the abort process.

Do you mean the progressDeadline is exceeded? minSuccessTime is used as a soak time before rollout to next cluster.
If you mean the progressDeadline, there's also a maxFailures field to consider, might need to clarify how the abort will be triggered when progressDeadline & maxFailures are defined and not defined.

@haoqing0110 I added new step Evaluate abort condition Please review it to see if it makes sense.

youngbupark · 2025-11-29T02:53:12Z

enhancements/sig-architecture/229-manifestworkreplicaset-rollback/README.md

+        1. if `.summary.updated` == `.summary.desiredTotal`,
+            1. Set `.status.updateRevision` to `.status.currentRevision`
+
+### Status transition


@qiujian16 This section show the transition of status condition for each situation. (I think it is more efficient than show the full status resource example) this doesn't include cluster count (summary) since it is covered above. but please let me know if you need additional info here.

youngbupark · 2025-11-29T02:53:58Z

enhancements/sig-architecture/229-manifestworkreplicaset-rollback/README.md

+        1. [Determine cluster rollout status (Failed, Success)](https://github.com/open-cluster-management-io/ocm/blob/main/pkg/work/hub/controllers/manifestworkreplicasetcontroller/manifestworkreplicaset_deploy_reconcile.go#L96)
+    1. Create Rollout handler
+    1. Calculate RolloutResult (rollout/timeout/removed candidate clusters)
+    1. `(NEW)` Evaluate abort condition,


@haoqing0110 I added the abort condition evaluation step to show how to decide whether it is abort required or not.

youngbupark · 2025-11-29T02:55:00Z

enhancements/sig-architecture/229-manifestworkreplicaset-rollback/README.md

+  ...
+```
+
+#### Calculate the hash


@qiujian16 I added the hash calculation details. PTAL.

youngbupark added 6 commits October 30, 2025 22:42

wip

846b535

wip

1f8a33d

add more

e8edda0

updated

6267d2c

add author

cf319bd

update title

b8f068e

openshift-ci bot added the do-not-merge/work-in-progress label Nov 12, 2025

youngbupark marked this pull request as ready for review November 12, 2025 05:43

openshift-ci bot removed the do-not-merge/work-in-progress label Nov 12, 2025

openshift-ci bot requested review from deads2k and qiujian16 November 12, 2025 05:43

youngbupark commented Nov 12, 2025

View reviewed changes

haoqing0110 reviewed Nov 12, 2025

View reviewed changes

annelaucg reviewed Nov 13, 2025

View reviewed changes

youngbupark added 7 commits November 15, 2025 18:49

fix feedback

a06171b

add desiredTotal

36a0aa2

revise

11472d5

reivse

acb87b3

wip

eab1e25

revise

a7d6f3d

revise more

65abfaa

qiujian16 reviewed Nov 17, 2025

View reviewed changes

updated

308d071

annelaucg mentioned this pull request Nov 19, 2025

[WIP] Ready & Progressing status in MWRS #163

Closed

haoqing0110 reviewed Nov 20, 2025

View reviewed changes

youngbupark mentioned this pull request Nov 21, 2025

ManifestWorkReplicaSet Rollout Plugin #160

Open

fix

402f64f

youngbupark added 3 commits November 28, 2025 02:01

wip

34de00f

addressed all

a75a60d

wip

a3610d2

youngbupark commented Nov 29, 2025

View reviewed changes


		## Design Details

		The `abort` operation cancels the progressing rollout and rolls it back to the stable revision automatically without updating `.spec.manifestWorkTemplate`, whereas `rollback` requires explicit manual action by updating `.spec.manifestWorkTemplate`.


		### ManifestWorkReplicaSet API Object

		To support abort and rollback feature, we will have the following API changes:


		## Summary

		This enhancement proposes adding rollback capabilities to ManifestWorkReplicaSet (MWRS) to enable safe recovery from failed multi-cluster deployments. The solution leverages Kubernetes ControllerRevision resources to maintain a historical record of MWRS template changes, similar to how StatefulSet and DaemonSet controllers track revision history.


		`Abort` is used to cancel the current rollout. When aborting the current rollout, spec is not changed, but `status.abort` is set to `true`, which will set `status.abortedTime`. then loading the older revision and then apply `.spec.manifestWorkTemplate` of the old revision to all clusters at once.

		On the other hand, `rollback` is the explicit action. If user wants to roll back to the older revision, the user (or cli) can find the revision from the list of ControllerRevision resources and apply the older manifestTemplate to update the `.spec.manifestWorkTemplate`.


		### `Abort` vs `Rollback`

		`Abort` is used to cancel the current rollout. When aborting the current rollout, spec is not changed, but `status.abort` is set to `true`, which will set `status.abortedTime`. then loading the older revision and then apply `.spec.manifestWorkTemplate` of the old revision to all clusters at once.


		### Open Questions

		1. Should we have the delay before starting aborting a rollout ?

Support rollback for ManifestWorkReplicaSet #164

Are you sure you want to change the base?

Support rollback for ManifestWorkReplicaSet #164

Conversation

youngbupark commented Nov 12, 2025

Uh oh!

openshift-ci bot commented Nov 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

youngbupark Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qiujian16 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haoqing0110 Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

youngbupark Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haoqing0110 Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

youngbupark Nov 16, 2025 •

edited

Loading

haoqing0110 Nov 20, 2025 •

edited

Loading

youngbupark Nov 21, 2025 •

edited

Loading

haoqing0110 Nov 21, 2025 •

edited

Loading