Skip to content

Conversation

@youngbupark
Copy link

  • This is the initial draft for supporting rollout plugin in ManifestWorkReplicaSet Work Controller
  • Note: rollback will be added when we propose MWRS automatic rollback enhancement.

@openshift-ci
Copy link

openshift-ci bot commented Oct 28, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: youngbupark
Once this PR has been reviewed and has the lgtm label, please assign qiujian16 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

The following service defines the contract between Work Controller and the plugin. Each call must be idempotent, stateless, and time-bounded (≤30 s) to ensure consistent controller reconciliation. Plugin server must implement the following APIs. The helpers to implement server and clients will be implemented in [ocm/sdk-go](https://github.com/open-cluster-management-io/sdk-go) repository.

```proto
// RolloutPluginService is the service for the rollout plugin.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the initial commit of gRPC server proto - open-cluster-management-io/sdk-go#154

Note: The implementation can change as we develop.

@youngbupark youngbupark changed the title KEP: ManifestWorkReplicaSet Rollout Plugin ManifestWorkReplicaSet Rollout Plugin Oct 28, 2025
@youngbupark youngbupark marked this pull request as ready for review October 29, 2025 02:00
@openshift-ci openshift-ci bot requested review from deads2k and qiujian16 October 29, 2025 02:00
// RolloutPluginService is the service for the rollout plugin.
service RolloutPluginService {
// Initialize initializes the plugin.
rpc Initialize(InitializeRequest) returns (InitializeResponse);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will need some clarification on error handling. What happens when a specific call fails? How would mwrs consumer to know and debug.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. I will add error handling section.

// If the validation is completed successfully, the plugin should return a OK result.
// If the validation is still in progress, the plugin should return a INPROGRESS result.
// If the validation is failed, the plugin should return a FAILED result.
rpc ValidateRollout(RolloutPluginRequest) returns (ValidateResponse);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will this be called in mwrs reconciler? I think a flow on when these APIs will be called in mwrs controller will be helpful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc includes mermaid sequence diagram to describe how each hook will be called. Please check this out.

// BeginRollout is called before the manifestwork resource is applied.
// It is used to prepare the rollout.
rpc BeginRollout(RolloutPluginRequest) returns (google.protobuf.Empty);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean any spec change on manifestwork will trigger this? What if placement changes but mw spec does not change in mwrs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this BeginRollout will be called before creating manifestwork in each cluster (please see the mermaid sequence diagram) BeginRollout will be called whenever MWRS creates or update manifestwork in each cluster namespace.

| False | False or not set | SUCCEEDED | Succeeded | Work has been successfully applied |
| Unknown/Not set | Any | N/A | Progressing | Conservative fallback: treat as still progressing |

The following state machine shows the expected transitions between rollout statuses:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this only enabled when rollout strategy in mwrs is set?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes correct.

# Plugin configuration
plugin:
# the image name of plugin sidecar
image: quay.io/open-cluster-management/my-rollout-plugin:v0.0.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if setting image only is enough if the plugin also need to wire with cloud service or mesh which mean the plugin also need some customized args or secret.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. hmm. in this case, it might be better to run plugin separately than sidecar, similar to envoy extension filter. WDYT? then we can also support multiple plugins.


## Alternatives

- Implement plugin server as a standalone service: In order to run plugin server securely, we need to enable TLS connection. Using a sidecar, we can avoid the complex network and security configuration. No newline at end of file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the other hand, plugin can be easily customized.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to run standalone services. Do you prefer running plugin service as a standalone service? @qiujian16

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, the plugin API might need to be authenticated also if it wires to a cloud service, or the placement needs to use its own service account to call the plugin.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I updated the design.

- rolloutStatus: The current [cluster rollout status](https://github.com/open-cluster-management-io/sdk-go/blob/main/pkg/apis/cluster/v1alpha1/rollout.go#L23-L39) (e.g., ToApply, Progressing, Succeeded, Failed, TimeOut, Skip).
- manifestRevisionName: The name of the manifest revision applied to the cluster.

### Configure custom plugin for work controller
Copy link
Member

@haoqing0110 haoqing0110 Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once a plugin is enabled, what's the behavior of a normal mwrs rollout which does not need to call any plugin?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once a plugin is enabled, what's the behavior of a normal mwrs rollout which does not need to call any plugin?

Plugin will be applied to all mwrs. @haoqing0110 do you think it is better to make it opt-in ?

workDriver: kube
# plugin configuration
plugins:
- name: my-rollout
Copy link
Author

@youngbupark youngbupark Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiujian16 @haoqing0110 rather than using sidecar model, I agree to use standalone service. here is new cluster-manager resource model to register the plugin. new one supports multiple plugins and users can select its plugin if they need.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

// ProgressRollout is called after the manifestwork is applied.
// Whenever the feedbacks are updated, this method will be called.
// The plugin can execute the rollout logic based on the feedback status changes.
rpc ProgressRollout(RolloutPluginRequest) returns (google.protobuf.Empty);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@annelaucg I removed Rollback specific operation as we discussed yesterday. it is much simpler.

### Using plugin in MWRS
Plugin is an opt-in feature. User can use the registered plugin by setting `.spec.placementRefs[*].rolloutStrategy.plugin`. The reconciler makes sure that plugin is available and show its availability in `PluginLoaded` status condition.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haoqing0110 the feedback makes sense. Plugin must be an opt-in feature. to minimize the impact on the existing resource, we should not use plugin by default although it is configured in clustermanager. Otherwise, it can change the behavior of mwrs resource which user may not expect.

Copy link
Author

@youngbupark youngbupark Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiujian16 / based on @haoqing0110 feedback, I proposed new MWRS resource change to make plugin opt-in even if cluster-manager resource has registered plugins.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, note the work-controller will not be allowed to access clustermanager CR. So plugin setting will need to be set on work-controller's flag, or setting in a configmap by cluster manager, and work-controller will mount the configmap which makes sense to me also.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update it.

workDriver: kube
# plugin configuration
plugins:
- name: my-rollout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

### Using plugin in MWRS
Plugin is an opt-in feature. User can use the registered plugin by setting `.spec.placementRefs[*].rolloutStrategy.plugin`. The reconciler makes sure that plugin is available and show its availability in `PluginLoaded` status condition.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, note the work-controller will not be allowed to access clustermanager CR. So plugin setting will need to be set on work-controller's flag, or setting in a configmap by cluster manager, and work-controller will mount the configmap which makes sense to me also.

# optional. secretRef is the reference for ca
secretRef:
name: my-rollout-ca
namespace: default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

namespace might not be needed. The secret has to be put in the open-cluster-management-hub ns.

Work->>Work: Create rollout handler
Work->>Work: Find rollout/removed/timeout candidate clusters (RolloutResult)
alt timeout clusters exists and .spec.placementRefs[*].rolloutStrategy.abortOnFailure is true
Note over Work, PluginServer: Start automatic abort
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need to set abort when Degraded or ValidateFailed.

So this will also need to be set during the ProgressRollout func and the ValidateRolloutFunc is that right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this will be set by MWRS controller. Please see the proposal in #164

rolloutStrategy:
type: Progressive
# plugin is optional.
plugin: my-rollout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need any failsafe if the plugin that the user adds causes issues? Or is that dependent on the user?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please elaborate on failsafe ? In general, user should be able to recognize it. using plugin is opt-in feature. MWRS should show the error in status if Plugin is throwing error or unavailable. It might be better to stop the reconciler loop rather than self-resolving the problem.


* `BeginRollout()`: Called before applying the ManifestWork to a target cluste to roll out new revision, allowing the plugin to perform any necessary preparations.
* `ProgressRollout()`: Called during every reconciliation loop to report the current rollout status to the plugin.
* `ValidateRolloutCompletion()`: Called after the `Progressing` condition on the target cluster's ManifestWork becomes False. This hook enables post-rollout testing before the status is set to Succeeded. For example, the plugin could use this call to create a new ManifestWork that runs a Kubernetes Job for verification. **Note that this hook must be called within the `ProgressDeadline` timeout.**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the rollout status is "Validating", what's the LastTransitionTime of it? LastTransitionTime is used to calculate timeout clusters.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LastTransitionTime will be the time when Progressing (ManifestWork) == False which is the same as the current behavior.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make it clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants