Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bulk scale-up in azure creates only one node per iteration sometimes #1984

Closed
palmerabollo opened this issue May 3, 2019 · 3 comments · Fixed by #2152
Closed

bulk scale-up in azure creates only one node per iteration sometimes #1984

palmerabollo opened this issue May 3, 2019 · 3 comments · Fixed by #2152
Assignees

Comments

@palmerabollo
Copy link

palmerabollo commented May 3, 2019

I think that cluster-autoscaler (CA) 1.3.x in Azure has problems dealing with affinity rules.

I use the following deployment to deploy a "pause" pod with two rules:

  • affinity: They must use a node in the agentpool named "genmlow"
  • podAntiAffinity: pods must not be deployed in the same node
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pause
  labels:
    app: pause
spec:
  replicas: 1
  selector:
    matchLabels:
      app: pause
  template:
    metadata:
      labels:
        app: pause
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: poolName
                    operator: In
                    values:
                      - genmlow
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - pause
              topologyKey: kubernetes.io/hostname
      containers:
      - image: "karlherler/pause:1.0"
        name: pause

The agentpool "genmlow" uses Standard_DS2_v2 machines (8GB) in a virtual machine scale set.

When I scale the number of replicas to 10 (kubectl scale deployment pause --replicas=10), I see that the cluster autoscaler (version 1.3.9, k8s 1.11.8) creates only one node per iteration, as if it was ignoring the affinity rules. See cluster-autoscaler logs, where nodes go from 0->1->2->...->N.

I0503 14:03:19.299146       1 azure_manager.go:261] Refreshed ASG list, next refresh after 2019-05-03 14:04:19.2991386 +0000 UTC m=+948.211672501
I0503 14:03:19.993383       1 scale_up.go:249] Pod default/pause-66cf84dcdb-2khzb is unschedulable
I0503 14:03:19.993412       1 scale_up.go:249] Pod default/pause-66cf84dcdb-l7587 is unschedulable
I0503 14:03:19.993418       1 scale_up.go:249] Pod default/pause-66cf84dcdb-t5mb8 is unschedulable
I0503 14:03:19.993422       1 scale_up.go:249] Pod default/pause-66cf84dcdb-xp2kn is unschedulable
I0503 14:03:19.993426       1 scale_up.go:249] Pod default/pause-66cf84dcdb-rpskf is unschedulable
I0503 14:03:19.993429       1 scale_up.go:249] Pod default/pause-66cf84dcdb-kkxc5 is unschedulable
I0503 14:03:19.993433       1 scale_up.go:249] Pod default/pause-66cf84dcdb-lbprj is unschedulable
I0503 14:03:19.993437       1 scale_up.go:249] Pod default/pause-66cf84dcdb-lmwmf is unschedulable
I0503 14:03:19.993441       1 scale_up.go:249] Pod default/pause-66cf84dcdb-c8njm is unschedulable
I0503 14:03:19.993446       1 scale_up.go:249] Pod default/pause-66cf84dcdb-gg6xh is unschedulable
...
I0503 14:03:20.071931       1 utils.go:187] Pod pause-66cf84dcdb-kkxc5 can't be scheduled on k8s-genl-24772259-vmss. Used cached predicate check results
I0503 14:03:20.072229       1 utils.go:187] Pod pause-66cf84dcdb-lbprj can't be scheduled on k8s-genl-24772259-vmss. Used cached predicate check results
I0503 14:03:20.072529       1 utils.go:187] Pod pause-66cf84dcdb-lmwmf can't be scheduled on k8s-genl-24772259-vmss. Used cached predicate check results
I0503 14:03:20.073242       1 utils.go:187] Pod pause-66cf84dcdb-c8njm can't be scheduled on k8s-genl-24772259-vmss. Used cached predicate check results
...
I0503 14:03:20.076758       1 scale_up.go:378] Best option to resize: k8s-genmlow-24772259-vmss
I0503 14:03:20.076770       1 scale_up.go:382] Estimated 1 nodes needed in k8s-genmlow-24772259-vmss
I0503 14:03:20.076783       1 scale_up.go:461] Final scale-up plan: [{k8s-genmlow-24772259-vmss 0->1 (max: 1000)}]
I0503 14:03:20.076796       1 scale_up.go:531] Scale-up: setting group k8s-genmlow-24772259-vmss size to 1
...
I0503 14:06:13.334377       1 scale_up.go:378] Best option to resize: k8s-genmlow-24772259-vmss
I0503 14:06:13.334411       1 scale_up.go:382] Estimated 1 nodes needed in k8s-genmlow-24772259-vmss
I0503 14:06:13.334470       1 scale_up.go:461] Final scale-up plan: [{k8s-genmlow-24772259-vmss 1->2 (max: 1000)}]
I0503 14:06:13.334503       1 scale_up.go:531] Scale-up: setting group k8s-genmlow-24772259-vmss size to 2
...
I0503 14:09:02.059191       1 scale_up.go:378] Best option to resize: k8s-genmlow-24772259-vmss
I0503 14:09:02.059243       1 scale_up.go:382] Estimated 1 nodes needed in k8s-genmlow-24772259-vmss
I0503 14:09:02.059310       1 scale_up.go:461] Final scale-up plan: [{k8s-genmlow-24772259-vmss 2->3 (max: 1000)}]
I0503 14:09:02.059350       1 scale_up.go:531] Scale-up: setting group k8s-genmlow-24772259-vmss size to 3
...
I0503 14:11:50.214206       1 scale_up.go:378] Best option to resize: k8s-genmlow-24772259-vmss
I0503 14:11:50.214228       1 scale_up.go:382] Estimated 1 nodes needed in k8s-genmlow-24772259-vmss
I0503 14:11:50.214245       1 scale_up.go:461] Final scale-up plan: [{k8s-genmlow-24772259-vmss 3->4 (max: 1000)}]
I0503 14:11:50.214262       1 scale_up.go:531] Scale-up: setting group k8s-genmlow-24772259-vmss size to 4
...
...

However, it only behaves this way when the pod has no requests. If I add the following requests:

  resources:
    requests:
      memory: 5Gi

Everything works as expected and the cluster autoscaler works as expected, creating the 10 virtual machines in a single batch (1->10). I guess it is because this time the autoscaler knows that it can not fit two pods in a single node (5Gi + 5Gi > 8GB), even if it still ignoring the affinity rules.

I0503 14:31:36.574678       1 scale_up.go:378] Best option to resize: k8s-genmlow-24772259-vmss
I0503 14:31:36.574722       1 scale_up.go:382] Estimated 10 nodes needed in k8s-genmlow-24772259-vmss
I0503 14:31:36.574752       1 scale_up.go:461] Final scale-up plan: [{k8s-genmlow-24772259-vmss 0->10 (max: 1000)}]
I0503 14:31:36.574786       1 scale_up.go:531] Scale-up: setting group k8s-genmlow-24772259-vmss size to 10

It looks like a bug to me. Using the same setup in AWS (cluster autoscaler 1.2.x instead of 1.3.x is the only difference) works fine, and the CA creates the 10 virtual machines no matter whether you specify the container memory requests or not.

@MaciekPytel
Copy link
Contributor

It's a known issue with pod affinity / antiaffinity: #257 (comment). The details are on the issue I linked, but in general pod affinity and (especially) antiaffinity don't work well with CA. It can cause CA to only add nodes one by one as you observe and it completely breaks CA performance on large clusters.
It's not easy to fix, because it's caused by pod affinity being implemented in a way that is conceptually incompatible with how autoscaler works. To fix it we'd need a significant refactor of either scheduler or autoscaler, neither of which is likely to happen soon.

@palmerabollo
Copy link
Author

palmerabollo commented May 6, 2019

Thanks @MaciekPytel. What I don't understand is why it works well on AWS. Shouldn't that logic be shared among all cloud implementations?

@feiskyer
Copy link
Member

feiskyer commented Jul 1, 2019

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants