-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP 5554: In place update pod resources alongside static cpu manager policy KEP creation #5555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
KEP 5554: In place update pod resources alongside static cpu manager policy KEP creation #5555
Conversation
esotsal
commented
Sep 21, 2025
- One-line PR description: Create new KEP 5554: In place update pod resources alongside static cpu manager policy
- Issue link: Support In place update pod resources alongside static cpu manager policy #5554
- Other comments:
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: esotsal The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
4c5c393
to
1240d58
Compare
@esotsal: GitHub didn't allow me to request PR reviews from the following users: Chunxia202410. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
1240d58
to
8973b16
Compare
c651aa2
to
24bfb5c
Compare
a0b72b1
to
abc61a0
Compare
* The pod's current NUMA Affinity (single or multiple NUMA nodes) can accommodate the resized CPU request. If yes, the current affinity is maintained. | ||
* If the current affinity is insufficient and expansion is needed the new NUMA affinity must always include all NUMA nodes from the current affinity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @esotsal , do you think it is necessary to generate more fine-grained NUMA hints here?
I consider an example for scale down.
The server has 2 NUMA nodes:
NUMA 0: CPUs {0,1,2,3,4,5,6,7}
NUMA 1: CPUs {8,9,10,11,12,13,14,15}
CPU 0 is reserved.
There is a Pod with 10 CPUs, and the CPUset is {4,5,8-15}, where CPUs 10 and 11 are promised CPUs.
When the POD scales down from 10 to 8 CPUs (actually requiring only 1 NUMA node).
- According to the KEP design.
The CPU NUMA hints would be {0,1} with preferred = false (if I understand correctly).
Assuming the Memory and Device both have two NUMA hints: {1} with preferred = true, and {0,1} with preferred = false.
After merging, the best NUMA hint would be {0,1} with preferred = false (according to my understanding).
If the topology manager policy is restricted, since preferred = false for best NUMA hint, the POD resizing would be rejected. - Consider more fine-grained solution for scale down: NUMA hints should be the combination of nodes containing the original CPUs, and must include the promised CPUs.
The CPU NUMA hints would be {1} with preferred = true, and {0,1} with preferred = false
Assuming the Memory and Device both have two NUMA hints: {1} with preferred = true, and {0,1} with preferred = false.
After merging, the best NUMA hints would be {1} with preferred = true. If the topology manager policy is restricted, the POD would be resized.
If my understanding is incorrect, please feel free to point it out. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a hard time picturing a downsize which can cause a topology manager policy failure. I can think of examples for upsize, but none for downsize. Point in case, a downsize is a transition which at worst makes the container shape as restrictive than the original allocation, can't be worse (e.g we can't downsize in a way which violate the single-numa-node policy: if the allocation spans across 2 nodes, at worse the downsized one will still span across 2 node).
While writing this comment, I realized that we should probably just make explicit that on downsize the end result must be a subset of the original result. This will help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, option 2 is what i was thinking as well @Chunxia202410 for the use case you describe. @ffromani has described it very well, will add this sentence in the KEP for clarity.
I have confirmed this behaviour, in this KEPs PR, against a VM with a same CPU topology as the example you have provided with restricted
topology manager policy option, creating a pod with one container 2cpus, then scaling it up to 10, afterwards scaling down to 2. Will push the added tests ( as well with tests against other topology manager policies single-numa-mode
, best-effort
) in the PR, need to do some clean up, will try to push the commit today or tomorrow.
abc61a0
to
37fa4b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave an initial pass. Designwise mostly questions and some areas to be clarified, I didn't see anything unexpected or concerning.
* prefer-closest-numa-modes ( GA, visible by default ) (1.32 or higher) [KEP-3545](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3545-improved-multi-numa-alignment) | ||
* max-allowable-numa-modes ( beta, visible by default ) (1.31 or higher) [KEP-4622](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4622-topologymanager-max-allowable-numa-nodes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are policy options, not topology manager policies. The options alter the behavior of the policies (where applicable)
* strict-cpu-reservation, | ||
* prefer-align-cpus-by-uncorecache | ||
|
||
The Kubelet requires the total CPU reservation from `--kube-reserved` and `--system-reserved` to be greater than zero when CPU management static policy is enabled. This KEP should ensure those reserved CPUs are kept during resize. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and/or --reserved-cpus
* must keep “promised” CPUs of a running container of a Guaranteed Pod CPUs. | ||
+ With the term “promised” we mean the CPUs allocated upon creation of the Guaranteed Pod checkpointed in CPU Managers checkpoint file, please refer to [Promised CPUs checkpoint](#promised-cpus-checkpoint) for more details of the proposed implementation using local storage. | ||
* attempt of allocating additional CPUs should adhere to the combination of Topology/Memory/CPU/Device/kubelet reservation policies | ||
+ All 40 possible combinations should work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this reads scary but I'm fairly confident that we can qucikly rule out some combinations. For example, strict-cpu-reservation
should never be a factor. OTOH ensuring continued correctness (= testing) may turn nontrivial.
When the CPU Manager, under a static policy, generates a NUMA Topology hint for a Guaranteed pod undergoing an in place CPU pod-resize, it follows these rules to determine the new affinity: | ||
|
||
* The pod's current NUMA Affinity (single or multiple NUMA nodes) can accommodate the resized CPU request. If yes, the current affinity is maintained. | ||
* If the current affinity is insufficient and expansion is needed the new NUMA affinity must always include all NUMA nodes from the current affinity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried that extending the hint may violate some topology manager policies. But this should be handled at topology manager level. It makes sense for the cpumanager to generate that hint.
* The pod's current NUMA Affinity (single or multiple NUMA nodes) can accommodate the resized CPU request. If yes, the current affinity is maintained. | ||
* If the current affinity is insufficient and expansion is needed the new NUMA affinity must always include all NUMA nodes from the current affinity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a hard time picturing a downsize which can cause a topology manager policy failure. I can think of examples for upsize, but none for downsize. Point in case, a downsize is a transition which at worst makes the container shape as restrictive than the original allocation, can't be worse (e.g we can't downsize in a way which violate the single-numa-node policy: if the allocation spans across 2 nodes, at worse the downsized one will still span across 2 node).
While writing this comment, I realized that we should probably just make explicit that on downsize the end result must be a subset of the original result. This will help.
* Under pkg/kubelet/types | ||
+ constants.go | ||
+ ErrorIncosistentCPUAllocation | ||
+ ErrorProhibitedCPUAllocation | ||
+ ErrorGetPromisedCPUSet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these errors can bubble up to the user, and I think they can, we should be consistent in naming with TopologyAffinityError
SMTAlignmentError
and UnexpectedAdmissionError
#### GA | ||
|
||
* User feedback (ideally from at least two distinct users) is green, | ||
* No major bugs reported for three months. | ||
* 2 examples of real-world usage | ||
* 2 installs | ||
* Allowing time for feedback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this may likely needed to moved to beta because the new graduation rules
// CPUManagerCheckpoint struct is used to store cpu/pod assignments in a checkpoint in v3 format | ||
type CPUManagerCheckpoint struct { | ||
PolicyName string `json:"policyName"` | ||
DefaultCPUSet string `json:"defaultCpuSet"` | ||
Entries map[string]map[string]string `json:"entries,omitempty"` | ||
Promised map[string]map[string]string `json:"promised,omitempty"` | ||
Checksum checksum.Checksum `json:"checksum"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the "v2" format is now the default and the "v1" format is largely unsupported, this makes things easier. Note that "checkpoint version" is largely an internal construct, is never mentioned or set explicitly anywhere
|
||
Inspect the kubelet configuration of the nodes: check feature gate and usage of the new option. | ||
|
||
Checking /var/lib/kubelet/cpu_manager_state, and look for `promised` field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my only true nack so far. We should stop directing users to peek to this file, which is already largely abused. Is not, never was and won't be part of the API, yet is consumed and considered source of truth by quite few users, and this already makes things hard when we need to change it, like in this very KEP.
So we would need a different mechanism, be it a metric or a new field in the podresources API or in any other official API.
|
||
In order to verify this feature is working, one should first inspect the kubelet configuration and ensure that all required feature gates are enabled as described in [Summary](#summary). | ||
|
||
Then user should create a guaranteed QoS Pod with integer CPU requests, inspect the `/var/lib/kubelet/cpu_manager_state` and check promised CPU set, for the created PoD. Upon creation promised should be equal to the assigned CPU. If it exists it means the feature is enabled and working. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
@@ -0,0 +1,3 @@ | |||
kep-number: 5554 | |||
alpha: | |||
approver: "TBD" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure you need to put a name and let the PRR team rebalance. Let's check if this is sufficient to make sure the PRR team is aware of this KEP; if so, we can surely keep TBD