-
Notifications
You must be signed in to change notification settings - Fork 43
Support rollback for ManifestWorkReplicaSet #164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Support rollback for ManifestWorkReplicaSet #164
Conversation
youngbupark
commented
Nov 12, 2025
- Support rollback for ManifestWorkReplicaSet
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: youngbupark The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
|
||
| ## Design Details | ||
|
|
||
| The `abort` operation cancels the progressing rollout and rolls it back to the stable revision automatically without updating `.spec.manifestWorkTemplate`, whereas `rollback` requires explicit manual action by updating `.spec.manifestWorkTemplate`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am following the same concept of abort and rollback of argo rollout. See this - https://argo-rollouts.readthedocs.io/en/stable/getting-started/#4-aborting-a-rollout
|
|
||
| ### ManifestWorkReplicaSet API Object | ||
|
|
||
| To support abort and rollback feature, we will have the following API changes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if I need to split the enhancement for rollback and abort features.
|
|
||
| ## Summary | ||
|
|
||
| This enhancement proposes adding rollback capabilities to ManifestWorkReplicaSet (MWRS) to enable safe recovery from failed multi-cluster deployments. The solution leverages Kubernetes ControllerRevision resources to maintain a historical record of MWRS template changes, similar to how StatefulSet and DaemonSet controllers track revision history. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update the plugin enhancment proposal after this proposal is approved - #160 I would rename rollback to abort in plugin hook name.
| progressive: | ||
| ... | ||
| # (NEW) abortOnFailure: if true, automatically aborts and rolls back on failure; otherwise, rollout ends on failure | ||
| abortOnFailure: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will abort follow the same rolloutStrategy? Like auto rollback all at once, Progressive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the first iteration, we will auto abort to all clusters for the simplicity. Please let me know if you think we still need to follow rolloutStrategy for abort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Abort to all clusters will make things easier as a start I think, just need to clarify the behavior in the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would define it as failureStrategy, with abortAll or rollbackAll as the strategy name. Try to avoid boolean type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok good. let me update it.
| | PlacementVerified | AsExpected, PlacementDecisionNotFound, PlacementDecisionEmpty, NotAsExpected | Indicates if Placement is valid | | ||
| | ManifestworkApplied | AsExpected, NotAsExpected, Processing | A ManifestWork has been created in each cluster defined by PlacementDecision | | ||
| | PlacementRollOut | Progressing, Complete, `(NEW)` RolloutDegraded | Indicates if RollOut Strategy is complete | | ||
| | `(NEW)` Progressing | NewRevisionCreated, FoundNewRevision, NewManifestWorkAvailable, RolloutAborted, ProgressDeadlineExceeded | The rollout is progressing. Progress for a rollout is considered when a new manifest is created or adopted, when new manifestwork rolls out | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the new condition Progressing be merged with PlacementRollOut? Let NewRevisionCreated, FoundNewRevision, NewManifestWorkAvailable, RolloutAborted, ProgressDeadlineExceeded replace reason
Progressing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PlacementRolledOut = True should be the indicator of Ready status, based on this comment.
If we merge the condition Progressing into PlacementRolledOut, we will lose track of whether the MWRS is Progressing, unless we check the Reason. When PlacementRolledOut Reason is RolloutDegraded, assuming we will set PlacementRolled = False. But this way, we cannot tell if it is Progressing still or Failed.
enhancements/sig-architecture/229-manifestworkreplicaset-rollback/README.md
Outdated
Show resolved
Hide resolved
|
|
||
| `Abort` is used to **cancel** the current rollout. When aborting the current rollout, spec is not changed, but `status.abort` is set to `true`, which will set `status.abortedTime`. then loading the older revision and then apply `.spec.manifestWorkTemplate` of the old revision to all clusters at once. | ||
|
|
||
| On the other hand, `rollback` is the explicit action. If user wants to roll back to the older revision, the user (or cli) can find the revision from the list of ControllerRevision resources and apply the older manifestTemplate to update the `.spec.manifestWorkTemplate`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean the user can not use Revision to manual rollout? but need to copy and update the .spec.manifestWorkTemplate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@haoqing0110 kubectl undo for deployment/stateful/daemonset first finds the older revision and update its pod template in the resource. core controllers don't have the functionality to change spec due to the potential race condition. Ideally, clusteradm should have the ability to rollback in the future for the better UX.
| ### Non-Goals | ||
|
|
||
| - Manual rollback via kubectl plugin commands (rollback will be performed by updating `.spec.manifestWorkTemplate`) | ||
| - Rollback of CRD versions (automatic rollback should be used with caution when CRDs are involved) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean CRDs deployed by MWRS will not be included in the rollback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will support any resources, however, we will need to add the caution in the document in the future. I would delete this part.
| - **Mitigation**: Using the automatic rollback feature (`abortOnFailure`) should be opt-in and used with caution. Users should be aware of the implications when rolling back CRD changes. | ||
|
|
||
| 2. **Storage Growth Risk** - Maintaining revision history could lead to unbounded storage growth. | ||
| - **Mitigation**: The `revisionHistoryLimit` field allows users to control the maximum number of revisions retained. The controller will automatically prune old revisions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any maximum that the user can define?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't thought about the number. it might be 10 ?
enhancements/sig-architecture/229-manifestworkreplicaset-rollback/README.md
Show resolved
Hide resolved
| | PlacementVerified | AsExpected, PlacementDecisionNotFound, PlacementDecisionEmpty, NotAsExpected | Indicates if Placement is valid | | ||
| | ManifestworkApplied | AsExpected, NotAsExpected, Processing | A ManifestWork has been created in each cluster defined by PlacementDecision | | ||
| | PlacementRollOut | Progressing, Complete, `(NEW)` RolloutDegraded | Indicates if RollOut Strategy is complete | | ||
| | `(NEW)` Progressing | NewRevisionCreated, FoundNewRevision, NewManifestWorkAvailable, RolloutAborted, ProgressDeadlineExceeded | The rollout is progressing. Progress for a rollout is considered when a new manifest is created or adopted, when new manifestwork rolls out | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PlacementRolledOut = True should be the indicator of Ready status, based on this comment.
If we merge the condition Progressing into PlacementRolledOut, we will lose track of whether the MWRS is Progressing, unless we check the Reason. When PlacementRolledOut Reason is RolloutDegraded, assuming we will set PlacementRolled = False. But this way, we cannot tell if it is Progressing still or Failed.
enhancements/sig-architecture/229-manifestworkreplicaset-rollback/README.md
Show resolved
Hide resolved
|
|
||
| ### `Abort` vs `Rollback` | ||
|
|
||
| `Abort` is used to **cancel** the current rollout. When aborting the current rollout, spec is not changed, but `status.abort` is set to `true`, which will set `status.abortedTime`. then loading the older revision and then apply `.spec.manifestWorkTemplate` of the old revision to all clusters at once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: MWRS.spec is not changed
| 1. If history creation failed: | ||
| 1. If it is because of name collision: | ||
| 1. If the collided history is same as `ManifestWorkReplicaSet`'s `.spec.ManifestWorkTemplate` desired state, there is the already created the history | ||
| 1. Otherwise, bump ManifestWorkReplicaSet `.status.collisionCount` by 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the intent to bump the .status.collisionCount and in the next reconcile loop, the ControllerRevision will get created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes right.
| 1. [Determine cluster rollout status (Failed, Success)](https://github.com/open-cluster-management-io/ocm/blob/main/pkg/work/hub/controllers/manifestworkreplicasetcontroller/manifestworkreplicaset_deploy_reconcile.go#L96) | ||
| 1. Create Rollout handler | ||
| 1. Calculate RolloutResult (rollout/timeout/removed candidate clusters) | ||
| 1. `(NEW)` If rolloutResult includes timeout clusters and `.spec.placementRefs[*].rolloutStrategy.abortOnFailure` is true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In argo rollout, do timed out pods also trigger abort/rollout action?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No argo rollout will trigger abort action only if analysis step is failed. Timeout abort is opt-in feature in argo rollout. We need to set progressDeadlineAbort: true explicitly. Same here, automatic abort will be opt-in feature, not opt-out.
qiujian16
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I'd like to have two examples on how mwrs status looks like during rollback:
- upgrade and abort, how will the status changes (conditions/reasons/summary etc) among each state. And how user can know whether a rollback is done in which revision?
- manual rollback, how user set rollback and check if rollback is finished.
| progressive: | ||
| ... | ||
| # (NEW) abortOnFailure: if true, automatically aborts and rolls back on failure; otherwise, rollout ends on failure | ||
| abortOnFailure: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would define it as failureStrategy, with abortAll or rollbackAll as the strategy name. Try to avoid boolean type.
enhancements/sig-architecture/229-manifestworkreplicaset-rollback/README.md
Show resolved
Hide resolved
enhancements/sig-architecture/229-manifestworkreplicaset-rollback/README.md
Outdated
Show resolved
Hide resolved
enhancements/sig-architecture/229-manifestworkreplicaset-rollback/README.md
Outdated
Show resolved
Hide resolved
|
|
||
| ### Open Questions | ||
|
|
||
| 1. Should we have the delay before starting aborting a rollout ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds reasonable, but we could consider this as an additional abort strategy in beta phase.
| 1. [Determine cluster rollout status (Failed, Success)](https://github.com/open-cluster-management-io/ocm/blob/main/pkg/work/hub/controllers/manifestworkreplicasetcontroller/manifestworkreplicaset_deploy_reconcile.go#L96) | ||
| 1. Create Rollout handler | ||
| 1. Calculate RolloutResult (rollout/timeout/removed candidate clusters) | ||
| 1. `(NEW)` If rolloutResult includes timeout clusters and `.spec.placementRefs[*].rolloutStrategy.abortOnFailure` is true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the condition to trigger automatic abort? How about the failed clusters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When minSuccessTime is exceeded for the current rollout group, it will start the abort process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the progressDeadline is exceeded? minSuccessTime is used as a soak time before rollout to next cluster.
If you mean the progressDeadline, there's also a maxFailures field to consider, might need to clarify how the abort will be triggered when progressDeadline & maxFailures are defined and not defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@haoqing0110 I added new step Evaluate abort condition Please review it to see if it makes sense.
| 1. if `.summary.updated` == `.summary.desiredTotal`, | ||
| 1. Set `.status.updateRevision` to `.status.currentRevision` | ||
|
|
||
| ### Status transition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qiujian16 This section show the transition of status condition for each situation. (I think it is more efficient than show the full status resource example) this doesn't include cluster count (summary) since it is covered above. but please let me know if you need additional info here.
| 1. [Determine cluster rollout status (Failed, Success)](https://github.com/open-cluster-management-io/ocm/blob/main/pkg/work/hub/controllers/manifestworkreplicasetcontroller/manifestworkreplicaset_deploy_reconcile.go#L96) | ||
| 1. Create Rollout handler | ||
| 1. Calculate RolloutResult (rollout/timeout/removed candidate clusters) | ||
| 1. `(NEW)` Evaluate abort condition, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@haoqing0110 I added the abort condition evaluation step to show how to decide whether it is abort required or not.
| ... | ||
| ``` | ||
|
|
||
| #### Calculate the hash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qiujian16 I added the hash calculation details. PTAL.