Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure capacity of the worker nodes #877

Open
palade opened this issue Sep 27, 2019 · 19 comments
Open

Configure capacity of the worker nodes #877

palade opened this issue Sep 27, 2019 · 19 comments
Labels
kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/backlog Higher priority than priority/awaiting-more-evidence.

Comments

@palade
Copy link

palade commented Sep 27, 2019

Would be possible to set the capacity of the worker nodes when the cluster is created?

@palade palade added the kind/support Categorizes issue or PR as a support question. label Sep 27, 2019
@aojea
Copy link
Contributor

aojea commented Sep 27, 2019

can you elaborate a bit more?
what's your use case?

@palade
Copy link
Author

palade commented Sep 27, 2019

@aojea Doing some scheduler work and would like to consider the CPU and memory capacities of each node. I could use labels for this but was wondering if it is possible to do this when the cluster is setup? Also if labels is the only option, would be possible to tag each node with particular labels from the initialisation script?

@aojea
Copy link
Contributor

aojea commented Sep 27, 2019

well, that seems interesting.@BenTheElder what do you think?
Basically the worker nodes are docker containers, so we should be able to use docker resource constrains to limit them https://docs.docker.com/config/containers/resource_constraints/
However, I don't know how this will work with nested cgroups 🤔

@WalkerGriggs
Copy link
Contributor

I don't know how this will work with nested cgroups

I might be wrong, but I don't think setting resource upper bounds will impact the current cgroup architecture. I do see performance issues with starving the node of resources, though.

I'm thinking about the UX side of things too; Docker resource constraints are pretty granular. Maybe we only expose some subset of the constraints, or maybe abstract them all together?

@BenTheElder
Copy link
Member

Feel free to try this out but IIRC this doesn't work.

Similarly if swap is enabled on the host memory limits won't work on your pods either.

@BenTheElder
Copy link
Member

I'm working on decoupling us from docker's command line, when we experiment again with support for ignite and other backends when that is complete, some of those can actually limit things because while they are based around running container images they use VMs :+)

@aojea
Copy link
Contributor

aojea commented Oct 1, 2019

docker resource constraints are working for me with swap, I'll send a PR implementing it
I have one node limited to 100M in this example

image

@aojea
Copy link
Contributor

aojea commented Oct 1, 2019

/assign

@BenTheElder
Copy link
Member

docker resource constraints are working for me with swap, I'll send a PR implementing it
I have one node limited to 100M in this example

That of course works but ... does it actually limit everything on the node? Have you deployed a pod trying to use more? What does kubelet report?

@aojea
Copy link
Contributor

aojea commented Oct 1, 2019

kind: Cluster
apiVersion: kind.sigs.k8s.io/v1alpha3
nodes:
# the control plane node
- role: control-plane
- role: worker
  constraints:
    memory: "100m"
    cpu: "1"

from https://kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource/#specify-a-memory-request-and-a-memory-limit

I modify to use directly and try to use 1.5g memory:

apiVersion: v1
kind: Pod
metadata:
  name: memory-demo
  namespace: mem-example
spec:
  containers:
  - name: memory-demo-ctr
    image: polinux/stress
    command: ["stress"]
    args: ["--vm", "1", "--vm-bytes", "1500M", "--vm-hang", "1"]

the pod takes more than 4 mins to be created, it doesn't seem to be a hard limit, maybe we should tweak something on cgroups, but checking inside the node it really seems is limiting the memory

asks:  19 total,   1 running,  18 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.5 us,  2.5 sy,  0.0 ni, 16.7 id, 80.3 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  32147.3 total,  16816.6 free,   1885.6 used,  13445.2 buff/cache
MiB Swap:   2055.0 total,    901.4 free,   1153.6 used.  29866.1 avail Mem

USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                
root      20   0  140504   4916      0 S   4.3   0.0   1:16.80 kube-proxy
root      20   0  130236   1720      0 D   3.7   0.0   0:30.99 kindnetd
root      20   0 2214724  70912  60684 S   3.3   0.2   0:37.25 kubelet
root      20   0 1587948  37516     24 D   3.0   0.1   0:36.98 stress
root      20   0 2210024  30812  23940 S   2.7   0.1   0:34.11 containerd
root      20   0    9336   4180   4180 S   1.3   0.0   0:01.93 containerd-shim
root      20   0   10744   4180   4180 S   0.7   0.0   0:01.70 containerd-shim
root      19  -1   22656   6684   6508 S   0.3   0.0   0:01.78 systemd-journal
root      20   0    6024   2756   2648 R   0.3   0.0   0:00.11 top                    
root      20   0   17524   7688   7688 S   0.0   0.0   0:00.53 systemd
root      20   0   10744   4180   4180 S   0.0   0.0   0:02.67 containerd-shim
root      20   0    1024      0      0 S   0.0   0.0   0:00.00 pause
root      20   0    9336   4180   4180 S   0.0   0.0   0:02.23 containerd-shim
root      20   0    1024      0      0 S   0.0   0.0   0:00.00 pause
root      20   0   10744   4608   4564 S   0.0   0.0   0:00.81 containerd-shim
root      20   0    1024      0      0 S   0.0   0.0   0:00.00 pause
root      20   0   10744   3980   3980 S   0.0   0.0   0:00.91 containerd-shim
root      20   0     744      0      0 S   0.0   0.0   0:00.06 stress
root      20   0    4052   2936   2936 S   0.0   0.0   0:00.05 bash

@aojea
Copy link
Contributor

aojea commented Oct 1, 2019

Looking at the kernel docs it seems that this is throttling https://www.kernel.org/doc/Documentation/cgroup-v1/blkio-controller.txt , check the block I/o stats

ONTAINER ID        NAME                 CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS                        
1698a9d1be92        kind-worker          14.64%              99.42MiB / 100MiB     99.42%              4.34MB / 361kB      1.91GB / 1.04GB     155                         
1a1a6fb0f69a        kind-control-plane   6.75%               1.268GiB / 31.39GiB   4.04%               512kB / 2.03MB      0B / 81.7MB         392       

do we want this? or is the idea to fail if it overcommit?

@BenTheElder BenTheElder added kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. labels Oct 2, 2019
@BenTheElder BenTheElder added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Nov 5, 2019
@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 3, 2020
@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 4, 2020
@BenTheElder BenTheElder added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Mar 15, 2020
@kubernetes-sigs kubernetes-sigs deleted a comment from fejta-bot Mar 15, 2020
@kubernetes-sigs kubernetes-sigs deleted a comment from fejta-bot Mar 15, 2020
@aojea
Copy link
Contributor

aojea commented Sep 9, 2020

I think that there are several optison:

  • use a provider that use VMs for the nodes
  • implement something like lxcfs to "fake" the resources and cheat cadvisor and the kubelet

otherwise you can set the limit manually as explained here
#1524

using container constraints (cgroups) is only valid for limiting the resources, but kubelet keeps using the whole host memory and cpu resources for its calculations.

@louiznk
Copy link

louiznk commented Oct 14, 2020

using container constraints (cgroups) is only valid for limiting the resources, but kubelet keeps using the whole host memory and cpu resources for its calculations.

Hello @aojea ,
This PR on cAdvisor adress this point.
I hope this will help.
Thanks

@aojea
Copy link
Contributor

aojea commented Oct 14, 2020

using container constraints (cgroups) is only valid for limiting the resources, but kubelet keeps using the whole host memory and cpu resources for its calculations.

Hello @aojea ,
This PR on cAdvisor adress this point.
I hope this will help.
Thanks

that sounds nice, do you think it has chances to be approved?

@louiznk
Copy link

louiznk commented Oct 14, 2020

using container constraints (cgroups) is only valid for limiting the resources, but kubelet keeps using the whole host memory and cpu resources for its calculations.

Hello @aojea ,
This PR on cAdvisor adress this point.
I hope this will help.
Thanks

that sounds nice, do you think it has chances to be approved?

I hope 🤷🏻‍♂️

@BenTheElder
Copy link
Member

Sadly no re: cAdvisor. This doesn't leave us with spectactular options. Maybe we can trick kubelet into reading our own ""vfs"" or something (like lxcfs?) 😬 , semi related: #2318's solution.

@LambertZhaglog
Copy link

Doing some scheduler work and would like to consider the CPU and memory capacities of each node. I could use labels for this...

@palade Did you mean we can limit node's CPU and memory capacities provided to kubernetes cluster by assigning some labels to node? which label you use? Can you give me an example? Thanks a lot.

@hwdef
Copy link
Member

hwdef commented Oct 20, 2023

any progress?
Will still be able to do this?

@BenTheElder
Copy link
Member

kubernetes/kubernetes#120832

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants