Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCP: Cluster Autoscaler only discovers first MIG when multiple MIGs share prefix in different zones #7772

Open
Debraj-git opened this issue Jan 27, 2025 · 5 comments

Comments

@Debraj-git
Copy link

Debraj-git commented Jan 27, 2025

Description:
We've identified an issue with Cluster Autoscaler (CAS) where it fails to discover and manage all MIGs when multiple MIGs share the same name prefix but exist in different zones.

Version:
Cluster Autoscaler: v1.27.5
Chart version: cluster-autoscaler-9.21.1
Cloud Provider: GCP (GKE)

Current Behavior:

When using auto-discovery with a prefix pattern, CAS only discovers and manages the first MIG it finds
Additional MIGs with the same prefix in other zones are marked with "no node group config"
Once a zone is cached for a MIG prefix, CAS never discovers similar MIGs in other zones

Expected Behavior:
CAS should discover and manage all MIGs that match the specified prefix pattern, regardless of their zone.

Reproduction Steps:
Have a GKE cluster with nodepools spanning multiple zones
Each nodepool creates 2 MIGs with same prefix in different zones

Configured CAS with auto-discovery:

      containers:
      - command:
        - ./cluster-autoscaler
        - --cloud-provider=gce
        - --namespace=kube-system
        - --node-group-auto-discovery=mig:namePrefix=gke-gke-cluster-oa5a-enpla9up21-spot-,min=2,max=6
        - --balance-similar-node-groups=true
        - --expander=priority
        - --logtostderr=true
        - --max-node-provision-time=5m
        - --min-replica-count=0
        - --scale-down-delay-after-add=15m
        - --scale-down-delay-after-delete=5m
        - --scale-down-delay-after-failure=3m
        - --scale-down-enabled=true
        - --scale-down-unneeded-time=5m
        - --scale-down-utilization-threshold=0.7
        - --scan-interval=1m
        - --skip-nodes-with-local-storage=false
        - --skip-nodes-with-system-pods=true
        - --stderrthreshold=info
        - --v=4

MIG with same prefix - across 2 AZ

Image
Logs:
I0127 05:19:47.609819       1 autoscaling_gce_client.go:504] found managed instance group gke-gke-cluster-oa5a-enpla9up21-spot--8a413476-grp matching regexp ^gke-gke-cluster-oa5a-enpla9up21-spot.+
I0127 05:19:47.761732       1 mig_info_provider.go:185] Regenerating MIG instances cache for production-platform-402407/us-east4-b/gke-gke-cluster-oa5a-enpla9up21-spot--8a413476-grp
// Only processes nodes from one MIG, ignores others
I0127 05:19:47.834646       1 pre_filtering_processor.go:67] Skipping gke-gke-cluster-oa5a-enpla9up21-spot--8a413476-g3de - node group min size reached (current: 2, min: 2)
I0127 05:19:47.834663       1 pre_filtering_processor.go:57] Node gke-gke-cluster-oa5a-enpla9up21-spot--ba82cb9b-6v9m should not be processed by cluster autoscaler (no node group config)  //this node belongs to a different MIG in the same nodepool
@rpsadarangani
Copy link

/label kind-bug

@k8s-ci-robot
Copy link
Contributor

@rpsadarangani: The label(s) /label kind-bug cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda, refactor, ci-short, ci-extended, ci-full. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

/label kind-bug

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@rpsadarangani
Copy link

/area provider-gcp

@k8s-ci-robot
Copy link
Contributor

@rpsadarangani: The label(s) area/provider-gcp cannot be applied, because the repository doesn't have them.

In response to this:

/area provider-gcp

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@adrianmoisey
Copy link
Member

/area cluster-autoscaler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants