Skip to content

Conversation

@Jeffwan
Copy link
Collaborator

@Jeffwan Jeffwan commented Nov 19, 2025

Pull Request Description

[Please provide a clear and concise description of your changes here]

Related Issues

Resolves: #[Insert issue number(s)]

Important: Before submitting, please complete the description above and review the checklist below.


Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

  • [Bug]: Corrections to existing functionality
  • [CI]: Changes to build process or CI pipeline
  • [Docs]: Updates or additions to documentation
  • [API]: Modifications to aibrix's API or interface
  • [CLI]: Changes or additions to the Command Line Interface
  • [Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

  • PR title includes appropriate prefix(es)
  • Changes are clearly explained in the PR description
  • New and existing tests pass successfully
  • Code adheres to project style and best practices
  • Documentation updated to reflect changes (if applicable)
  • Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

* Update stormservice golang client
* Improve the test coverage
* Refactor the API to support manual resume
* improve the canary features
* Leave e2e test to future PRs
* fix lint and verify issues
* Polish the canary status
* Simplify the canary status fields
* Final patch

Signed-off-by: Jiaxin Shan <[email protected]>

Update CRDs + Yamls

Signed-off-by: Jiaxin Shan <[email protected]>

Remove RoleCanaryCounts and TotalCanaryPods

Signed-off-by: Jiaxin Shan <[email protected]>

Simplify the PauseStep by removing the waiting period

Signed-off-by: Jiaxin Shan <[email protected]>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Jeffwan, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust canary deployment mechanism for StormService resources, providing fine-grained control over application upgrades. It allows users to define a sequence of steps for gradually shifting traffic to new versions and to strategically pause the rollout for validation or manual checks. This feature significantly enhances the safety and flexibility of deploying changes to StormService-managed applications.

Highlights

  • Canary Deployment Strategy: Introduced a new CanaryUpdateStrategy for StormService resources, enabling gradual rollouts with defined steps for traffic shifting and pauses.
  • Pause Rollout Support: Added functionality to pause canary deployments, allowing for manual intervention or timed pauses during an upgrade process.
  • API and CRD Extensions: Extended the StormService API with new types like CanaryStatus, CanaryStep, and PauseStep, along with corresponding CRD schema updates and deepcopy implementations.
  • Controller Logic Integration: Integrated the canary deployment logic into the StormService controller's reconciliation loop, including handling of scaling, rollout constraints (maxUnavailable/maxSurge), and status updates during canary phases.
  • Enhanced Testing and Samples: Added comprehensive unit and integration tests for the new canary features, along with sample YAML configurations for both pooled and replica mode canary deployments.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant feature: canary deployment support for StormService upgrades. It adds new API types for defining canary strategies (CanaryUpdateStrategy, CanaryStep, PauseStep) and tracking their status (CanaryStatus). The controller logic is extensively updated to handle the canary lifecycle, including initialization, step processing (both weight-based and pauses), and completion, with distinct handling for replica and pooled modes. The changes are well-structured and accompanied by new unit and integration tests. My review focuses on a critical discrepancy in the generated CRD files, a potential logic issue in the canary replica calculation, and several opportunities to improve clarity in the API documentation and logging to ensure the feature is robust and easy for users to understand.

Comment on lines +4178 to +4186
roleCanaryCounts:
additionalProperties:
format: int32
type: integer
type: object
totalCanaryPods:
format: int32
type: integer
type: object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's a discrepancy between the CRD definition in this Helm chart and the Go API type. The fields roleCanaryCounts and totalCanaryPods are defined here but do not exist in the CanaryStatus struct in api/orchestration/v1alpha1/stormservice_types.go. This will cause apply/unmarshal errors for clients using the Go types and can lead to unexpected behavior. Please remove these fields from the CRD to align it with the source of truth in the API definition.

Comment on lines +689 to +699
if desired >= currentUpdated {
// going UP: bounded by surge
upper := currentUpdated + maxSurge
if upper > total {
upper = total
}
if desired < upper {
achievable = desired
} else {
achievable = upper
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for calculating the upper bound of achievable canary replicas when scaling up seems incorrect. The line upper := currentUpdated + maxSurge suggests that maxSurge can be added directly to the number of updated replicas. However, maxSurge is a limit on the total number of replicas (spec.replicas + maxSurge), not an incremental budget for new replicas. This could lead to violating the total replica count constraints. The correct calculation should be based on how many old RoleSets can be replaced while respecting both maxSurge and maxUnavailable on the total set of active RoleSets. A similar, more correct calculation is already used in canaryRollingUpdate to determine the surge count. Consider refactoring this logic to be more accurate and consistent.

Comment on lines +235 to +239
// Duration field is accepted but not implemented.
// At this moment, all pauses are manual and require removing the pause condition to resume.
// - pause: {} # this is accepted
// - pause: # api accepted but not implemented.
// duration: "60s"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment here clearly states that the Duration field is not implemented and all pauses are manual. This is great for developers reading the Go code. However, this important information is not propagated to the CRD definition, which can be misleading for users of the API. It would be beneficial to add a +kubebuilder:doc comment to ensure this limitation is reflected in the CRD's description field for PauseStep.Duration.

Comment on lines +386 to +388
// applyPooledModeCanaryWeight distributes new version across affected roles based on weight
// Now with affected-role detection: only roles that changed will be updated
func (r *StormServiceReconciler) applyPooledModeCanaryWeight(ctx context.Context, stormService *orchestrationv1alpha1.StormService, weight, totalReplicas int32, current *orchestrationv1alpha1.StormService, currentCR, updateCR *apps.ControllerRevision) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for pooled mode canary updates seems to treat setWeight as a gate rather than a percentage. The applyPooledModeCanaryWeight function updates all pods of affected roles at once, regardless of the weight percentage. This behavior is different from replica mode and might be surprising to users. It would be beneficial to add a comment to the CanaryUpdateStrategy or CanaryStep API type definitions in stormservice_types.go to clarify how setWeight behaves in pooled mode (i.e., that it acts as a trigger to update the next set of changed roles, rather than a percentage-based rollout).

baseCurrent := int32(len(currentRevisionSets))
baseUpdated := int32(len(updatedRevisionSets))
expectCurrentReplica, expectUpdatedReplica := calculateReplicas(expectReplica, baseCurrent, baseUpdated)
klog.Infof("scaling out stormservice %s/%s, current revision %s, updated revision %s, currentReplica %d, updatedReplica %d, expectCurrentReplica: %d, expectUpdatedReplica: %d", stormService.Namespace, stormService.Name, currentRevision, updatedRevision, len(currentRevisionSets), len(updatedRevisionSets), expectCurrentReplica, expectUpdatedReplica)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The log message uses currentReplica and updatedReplica as keys, which could be confused with the fields in StormServiceStatus. Since the values being logged are the counts of RoleSets from the live state (len(currentRevisionSets)), it would be clearer to use more specific key names in the log, such as currentRevisionSetCount and updatedRevisionSetCount, to avoid ambiguity.

Suggested change
klog.Infof("scaling out stormservice %s/%s, current revision %s, updated revision %s, currentReplica %d, updatedReplica %d, expectCurrentReplica: %d, expectUpdatedReplica: %d", stormService.Namespace, stormService.Name, currentRevision, updatedRevision, len(currentRevisionSets), len(updatedRevisionSets), expectCurrentReplica, expectUpdatedReplica)
klog.Infof("scaling out stormservice %s/%s, current revision %s, updated revision %s, currentRevisionSetCount %d, updatedRevisionSetCount %d, expectCurrentReplica: %d, expectUpdatedReplica: %d", stormService.Namespace, stormService.Name, currentRevision, updatedRevision, len(currentRevisionSets), len(updatedRevisionSets), expectCurrentReplica, expectUpdatedReplica)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant