-
Notifications
You must be signed in to change notification settings - Fork 490
WIP: [Feat] Support StormService pause rollout in upgrade #1765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
WIP: [Feat] Support StormService pause rollout in upgrade #1765
Conversation
* Update stormservice golang client * Improve the test coverage * Refactor the API to support manual resume * improve the canary features * Leave e2e test to future PRs * fix lint and verify issues * Polish the canary status * Simplify the canary status fields * Final patch Signed-off-by: Jiaxin Shan <[email protected]> Update CRDs + Yamls Signed-off-by: Jiaxin Shan <[email protected]> Remove RoleCanaryCounts and TotalCanaryPods Signed-off-by: Jiaxin Shan <[email protected]> Simplify the PauseStep by removing the waiting period Signed-off-by: Jiaxin Shan <[email protected]>
Signed-off-by: Jiaxin Shan <[email protected]>
Summary of ChangesHello @Jeffwan, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a robust canary deployment mechanism for StormService resources, providing fine-grained control over application upgrades. It allows users to define a sequence of steps for gradually shifting traffic to new versions and to strategically pause the rollout for validation or manual checks. This feature significantly enhances the safety and flexibility of deploying changes to StormService-managed applications. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant feature: canary deployment support for StormService upgrades. It adds new API types for defining canary strategies (CanaryUpdateStrategy, CanaryStep, PauseStep) and tracking their status (CanaryStatus). The controller logic is extensively updated to handle the canary lifecycle, including initialization, step processing (both weight-based and pauses), and completion, with distinct handling for replica and pooled modes. The changes are well-structured and accompanied by new unit and integration tests. My review focuses on a critical discrepancy in the generated CRD files, a potential logic issue in the canary replica calculation, and several opportunities to improve clarity in the API documentation and logging to ensure the feature is robust and easy for users to understand.
| roleCanaryCounts: | ||
| additionalProperties: | ||
| format: int32 | ||
| type: integer | ||
| type: object | ||
| totalCanaryPods: | ||
| format: int32 | ||
| type: integer | ||
| type: object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a discrepancy between the CRD definition in this Helm chart and the Go API type. The fields roleCanaryCounts and totalCanaryPods are defined here but do not exist in the CanaryStatus struct in api/orchestration/v1alpha1/stormservice_types.go. This will cause apply/unmarshal errors for clients using the Go types and can lead to unexpected behavior. Please remove these fields from the CRD to align it with the source of truth in the API definition.
| if desired >= currentUpdated { | ||
| // going UP: bounded by surge | ||
| upper := currentUpdated + maxSurge | ||
| if upper > total { | ||
| upper = total | ||
| } | ||
| if desired < upper { | ||
| achievable = desired | ||
| } else { | ||
| achievable = upper | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for calculating the upper bound of achievable canary replicas when scaling up seems incorrect. The line upper := currentUpdated + maxSurge suggests that maxSurge can be added directly to the number of updated replicas. However, maxSurge is a limit on the total number of replicas (spec.replicas + maxSurge), not an incremental budget for new replicas. This could lead to violating the total replica count constraints. The correct calculation should be based on how many old RoleSets can be replaced while respecting both maxSurge and maxUnavailable on the total set of active RoleSets. A similar, more correct calculation is already used in canaryRollingUpdate to determine the surge count. Consider refactoring this logic to be more accurate and consistent.
| // Duration field is accepted but not implemented. | ||
| // At this moment, all pauses are manual and require removing the pause condition to resume. | ||
| // - pause: {} # this is accepted | ||
| // - pause: # api accepted but not implemented. | ||
| // duration: "60s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment here clearly states that the Duration field is not implemented and all pauses are manual. This is great for developers reading the Go code. However, this important information is not propagated to the CRD definition, which can be misleading for users of the API. It would be beneficial to add a +kubebuilder:doc comment to ensure this limitation is reflected in the CRD's description field for PauseStep.Duration.
| // applyPooledModeCanaryWeight distributes new version across affected roles based on weight | ||
| // Now with affected-role detection: only roles that changed will be updated | ||
| func (r *StormServiceReconciler) applyPooledModeCanaryWeight(ctx context.Context, stormService *orchestrationv1alpha1.StormService, weight, totalReplicas int32, current *orchestrationv1alpha1.StormService, currentCR, updateCR *apps.ControllerRevision) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for pooled mode canary updates seems to treat setWeight as a gate rather than a percentage. The applyPooledModeCanaryWeight function updates all pods of affected roles at once, regardless of the weight percentage. This behavior is different from replica mode and might be surprising to users. It would be beneficial to add a comment to the CanaryUpdateStrategy or CanaryStep API type definitions in stormservice_types.go to clarify how setWeight behaves in pooled mode (i.e., that it acts as a trigger to update the next set of changed roles, rather than a percentage-based rollout).
| baseCurrent := int32(len(currentRevisionSets)) | ||
| baseUpdated := int32(len(updatedRevisionSets)) | ||
| expectCurrentReplica, expectUpdatedReplica := calculateReplicas(expectReplica, baseCurrent, baseUpdated) | ||
| klog.Infof("scaling out stormservice %s/%s, current revision %s, updated revision %s, currentReplica %d, updatedReplica %d, expectCurrentReplica: %d, expectUpdatedReplica: %d", stormService.Namespace, stormService.Name, currentRevision, updatedRevision, len(currentRevisionSets), len(updatedRevisionSets), expectCurrentReplica, expectUpdatedReplica) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log message uses currentReplica and updatedReplica as keys, which could be confused with the fields in StormServiceStatus. Since the values being logged are the counts of RoleSets from the live state (len(currentRevisionSets)), it would be clearer to use more specific key names in the log, such as currentRevisionSetCount and updatedRevisionSetCount, to avoid ambiguity.
| klog.Infof("scaling out stormservice %s/%s, current revision %s, updated revision %s, currentReplica %d, updatedReplica %d, expectCurrentReplica: %d, expectUpdatedReplica: %d", stormService.Namespace, stormService.Name, currentRevision, updatedRevision, len(currentRevisionSets), len(updatedRevisionSets), expectCurrentReplica, expectUpdatedReplica) | |
| klog.Infof("scaling out stormservice %s/%s, current revision %s, updated revision %s, currentRevisionSetCount %d, updatedRevisionSetCount %d, expectCurrentReplica: %d, expectUpdatedReplica: %d", stormService.Namespace, stormService.Name, currentRevision, updatedRevision, len(currentRevisionSets), len(updatedRevisionSets), expectCurrentReplica, expectUpdatedReplica) |
Pull Request Description
[Please provide a clear and concise description of your changes here]
Related Issues
Resolves: #[Insert issue number(s)]
Important: Before submitting, please complete the description above and review the checklist below.
Contribution Guidelines (Expand for Details)
We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:
Pull Request Title Format
Your PR title should start with one of these prefixes to indicate the nature of the change:
[Bug]: Corrections to existing functionality[CI]: Changes to build process or CI pipeline[Docs]: Updates or additions to documentation[API]: Modifications to aibrix's API or interface[CLI]: Changes or additions to the Command Line Interface[Misc]: For changes not covered above (use sparingly)Note: For changes spanning multiple categories, use multiple prefixes in order of importance.
Submission Checklist
By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.