Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HOSTEDCP-2169: Add aro scheduler #5417

Merged
merged 1 commit into from
Jan 29, 2025

Conversation

Patryk-Stefanski
Copy link
Contributor

@Patryk-Stefanski Patryk-Stefanski commented Jan 17, 2025

What this PR does / why we need it:

The reason for implementing this scheduler is to help Azure Red Hat OpenShift (ARO) scale to 500 nodes. With this scheduler, ARO can dynamically control the CPU and memory requests for all pods and the Kube API Server (KAS) GoMem limits based on the number of nodes a HostedCluster has. This capability is essential for efficiently scaling a HostedCluster to 500 nodes.

Which issue(s) this PR fixes (optional, use fixes #<issue_number>(, fixes #<issue_number>, ...) format, where issue_number might be a GitHub issue, or a Jira story:
Fixes # HOSTEDCP-2169

Checklist

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 17, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 17, 2025

@Patryk-Stefanski: This pull request references HOSTEDCP-2169 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, use fixes #<issue_number>(, fixes #<issue_number>, ...) format, where issue_number might be a GitHub issue, or a Jira story:
Fixes #

Checklist

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from csrwng and hasueki January 17, 2025 11:14
@Patryk-Stefanski Patryk-Stefanski marked this pull request as draft January 17, 2025 11:14
@openshift-ci openshift-ci bot added area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Jan 17, 2025
support/util/scheduler.go Outdated Show resolved Hide resolved
support/util/scheduler.go Outdated Show resolved Hide resolved
@Patryk-Stefanski Patryk-Stefanski force-pushed the HOSTEDCP-2169 branch 2 times, most recently from f07a91e to 51f3c10 Compare January 21, 2025 15:21
@Patryk-Stefanski Patryk-Stefanski marked this pull request as ready for review January 21, 2025 19:46
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 21, 2025
@openshift-ci openshift-ci bot requested a review from enxebre January 21, 2025 19:47
@Patryk-Stefanski Patryk-Stefanski force-pushed the HOSTEDCP-2169 branch 2 times, most recently from 8583818 to 0ffcb3a Compare January 22, 2025 15:51
@Patryk-Stefanski
Copy link
Contributor Author

/retest-required

@Patryk-Stefanski
Copy link
Contributor Author

/test e2e-aws

support/util/scheduler.go Outdated Show resolved Hide resolved
@enxebre
Copy link
Member

enxebre commented Jan 23, 2025

can you please include in the PR desc or godoc the purpose and outcome of this new controller? e.g.
"The controller/scheduler reconciles HostedClusters applyng the corresponding annotations based on the size label assigned to the given HostedCluster. Based on those annotations the hostedCluster controller can adjust compute consumption choices and influence scheduling via affinity/antiaffinity selector/tolerations." (if that's accurate indeed)

@Patryk-Stefanski
Copy link
Contributor Author

/test e2e-aks-4-18

@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 23, 2025

@Patryk-Stefanski: This pull request references HOSTEDCP-2169 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

What this PR does / why we need it:

The reason for implementing this scheduler is to help Azure Red Hat OpenShift (ARO) scale to 500 nodes. With this scheduler, ARO can dynamically control the CPU and memory requests for all pods and the Kube API Server (KAS) GoMem limits based on the number of nodes a HostedCluster has. This capability is essential for efficiently scaling a HostedCluster to 500 nodes.

Which issue(s) this PR fixes (optional, use fixes #<issue_number>(, fixes #<issue_number>, ...) format, where issue_number might be a GitHub issue, or a Jira story:
Fixes # HOSTEDCP-2169

Checklist

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the area/documentation Indicates the PR includes changes for documentation label Jan 24, 2025
@muraee
Copy link
Contributor

muraee commented Jan 24, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 24, 2025
@Patryk-Stefanski
Copy link
Contributor Author

/retest-required

"sigs.k8s.io/controller-runtime/pkg/reconcile"
)

func TestAzureSchedulerReconcile(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is within the scheduler/azure, so TestReconcile seems should be sufficient

sizingConfig: sizingConfig,
expectError: false,
expectRequeue: false,
expectAnnotations: map[string]string{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason to not validate the rest of the effects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ill add test cases for the rest of the effects to TestSetHostedClusterSchedulingAnnotations

@enxebre
Copy link
Member

enxebre commented Jan 27, 2025

can we can an e2e similar to TestCreateClusterRequestServingIsolation?

@enxebre
Copy link
Member

enxebre commented Jan 27, 2025

lgtm other than #5417 (comment)
I'd defer to @csrwng for his review and tag

@Patryk-Stefanski
Copy link
Contributor Author

can we can an e2e similar to TestCreateClusterRequestServingIsolation?

Yep I was planning on creating a follow up task to e2e this.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 27, 2025
Copy link
Contributor

@csrwng csrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit on the doc, otherwise lgtm


- AKS cluster with cluster-autoscaler enabled and using Standard\_D4s\_v4 VMs for this example. (--enable-cluster-autoscaler flag when installing AKS cluster, with --min-count 2 --max-count 6)
- Hypershift operator with size tagging enabled. (--enable-size-tagging flag when installing hypershift operator)
- HostedCluster with the `ClusterSizingConfiguration` resource created. (Default resource is created by the hypershift operator)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little confusing. What is a HostedCluster with ClusterSizingConfiguration?
I would call them out as separate prereqs.
It would also be good to specify the name of the HostedCluster and NodePool that you're using throughout your example.

increase: 0s
```

3. Scale nodepool up to 3 nodes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this there is no mention of nodepool pstefans-3

@csrwng
Copy link
Contributor

csrwng commented Jan 27, 2025

/approve

Copy link
Contributor

openshift-ci bot commented Jan 27, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: csrwng, Patryk-Stefanski

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 27, 2025
Copy link
Contributor

openshift-ci bot commented Jan 27, 2025

@Patryk-Stefanski: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn f343ade link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@devguyio
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 29, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 55c2de5 into openshift:main Jan 29, 2025
13 of 14 checks passed
@openshift-bot
Copy link

[ART PR BUILD NOTIFIER]

Distgit: hypershift
This PR has been included in build ose-hypershift-container-v4.19.0-202501291707.p0.g55c2de5.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants