-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support new topologySpread scheduling constraints #852
base: main
Are you sure you want to change the base?
feat: support new topologySpread scheduling constraints #852
Conversation
aed9f49
to
54af319
Compare
Pull Request Test Coverage Report for Build 13139823506Details
💛 - Coveralls |
4329131
to
f0127f2
Compare
b23a605
to
0408162
Compare
0408162
to
6673b5b
Compare
Holding to include |
dae9cfd
to
1d46f34
Compare
1d46f34
to
7b3827f
Compare
Hi @jmdeal is there anything blocking this or any way I can help to move this on (testing a custom build or something)? |
/assign @njtran |
@engedaam: GitHub didn't allow me to assign the following users: njtarn. Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/assign @njtran |
/unassign njtran |
/assign jonathan-innis |
And I think I figured out why! Turns out that we really haven't been doing our scheduling benchmarking correctly in a while! See #1930 |
5958693
to
01de6ac
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jmdeal The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
for domain := range domains { | ||
domainCounts[domain] = 0 | ||
} | ||
func NewTopologyGroup(topologyType TopologyType, topologyKey string, pod *v1.Pod, namespaces sets.Set[string], labelSelector *metav1.LabelSelector, maxSkew int32, minDomains *int32, taintPolicy *v1.NodeInclusionPolicy, affinityPolicy *v1.NodeInclusionPolicy, domainGroup TopologyDomainGroup) *TopologyGroup { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func NewTopologyGroup(topologyType TopologyType, topologyKey string, pod *v1.Pod, namespaces sets.Set[string], labelSelector *metav1.LabelSelector, maxSkew int32, minDomains *int32, taintPolicy *v1.NodeInclusionPolicy, affinityPolicy *v1.NodeInclusionPolicy, domainGroup TopologyDomainGroup) *TopologyGroup { | |
func NewTopologyGroup(topologyType TopologyType, topologyKey string, pod *v1.Pod, namespaces sets.Set[string], labelSelector *metav1.LabelSelector, maxSkew int32, minDomains *int32, taintPolicy, affinityPolicy *v1.NodeInclusionPolicy, domainGroup TopologyDomainGroup) *TopologyGroup { |
the tiniest nit known to man
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reformatted to put each argument on it's own line, since this is beginning to wrap off of my screen. Given that, I think leaving both types in is more readable, but let me know what you think. Personally, I prefer always including the types as I find it a little more readable, but I don't feel to strongly.
First pass -- I didn't look at everything and have some starting questions |
/remove-label blocked |
@rschalo: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
09e5867
to
a00643d
Compare
} | ||
|
||
// otherwise, we need to match the combination of label selector and any term of the required node affinities since | ||
// those terms are OR'd together | ||
var filter TopologyNodeFilter | ||
filter := TopologyNodeFilter{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't attach tolerations in this codepath? Seems wrong that we wouldn't attach tolerations here -- also, since there isn't a test failing for this scenario seems like we need to add one
@@ -180,9 +180,9 @@ func HasDoNotDisrupt(pod *corev1.Pod) bool { | |||
return pod.Annotations[v1.DoNotDisruptAnnotationKey] == "true" | |||
} | |||
|
|||
// ToleratesDisruptedNoScheduleTaint returns true if the pod tolerates karpenter.sh/disrupted:NoSchedule taint | |||
// ToleratesDisruptionNoScheduleTaint returns true if the pod tolerates karpenter.sh/disruption:NoSchedule=Disrupting taint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look right -- I think we are still using karpenter.sh/disrupted:NoSchedule
"sigs.k8s.io/karpenter/pkg/scheduling" | ||
) | ||
|
||
// TopologyDomainGroup tracks the domains for a single topology. Additionally, it tracks the taints associated with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just tracks the eligible domains, right?
func (t TopologyDomainGroup) Insert(domain string, taints ...v1.Taint) { | ||
// Note: This could potentially be improved by removing any set of which the new set of taints is a proper subset. | ||
// Currently this is only handled when the incoming set is the empty set. | ||
if _, ok := t[domain]; !ok || len(taints) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still took me a second to parse this -- it might be worth pushing that comment down below up here or maybe add a second one -- just because I didn't get it until I saw the comment down below
|
||
// ForEachToleratedDomain calls f on each domain tracked by the TopologyDomainGroup which are also tolerated by the provided pod. | ||
func (t TopologyDomainGroup) ForEachToleratedDomain(pod *v1.Pod, f func(domain string)) { | ||
for domain, taintGroups := range t { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider marking down that we could potentially improve this code by just storing the domains that map to certain cached tolerations -- small improvement but maybe worth doing eventually
@@ -278,6 +278,28 @@ func (t *Topology) countDomains(ctx context.Context, tg *TopologyGroup) error { | |||
pods = append(pods, podList.Items...) | |||
} | |||
|
|||
// capture new domain values from existing nodes that may not have any pods selected by the topology group | |||
// scheduled to them already | |||
t.cluster.ForEachNode(func(n *state.StateNode) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting down that we have plans to move this over into domainGroups since domainGroups should really capture all of the domains that are eligible across the cluster (including NodePools) rather than just the NodePools (which is what they are doing right now)
@@ -278,6 +278,28 @@ func (t *Topology) countDomains(ctx context.Context, tg *TopologyGroup) error { | |||
pods = append(pods, podList.Items...) | |||
} | |||
|
|||
// capture new domain values from existing nodes that may not have any pods selected by the topology group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some discussion with @jmdeal, we should consider passing state nodes directly into this function rather than using the t.cluster.ForEach and holding a read lock throughout the time
return true | ||
} | ||
// ensure we at least have a count of zero for this potentially new topology domain | ||
if _, countExists := tg.domains[domain]; !countExists { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Since this is more standard
if _, countExists := tg.domains[domain]; !countExists { | |
if _, ok := tg.domains[domain]; !ok { |
@@ -323,7 +345,17 @@ func (t *Topology) countDomains(ctx context.Context, tg *TopologyGroup) error { | |||
func (t *Topology) newForTopologies(p *corev1.Pod) []*TopologyGroup { | |||
var topologyGroups []*TopologyGroup | |||
for _, cs := range p.Spec.TopologySpreadConstraints { | |||
topologyGroups = append(topologyGroups, NewTopologyGroup(TopologyTypeSpread, cs.TopologyKey, p, sets.New(p.Namespace), cs.LabelSelector, cs.MaxSkew, cs.MinDomains, t.domains[cs.TopologyKey])) | |||
for _, key := range cs.MatchLabelKeys { | |||
if value, ok := p.ObjectMeta.Labels[key]; ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if value, ok := p.ObjectMeta.Labels[key]; ok { | |
if value, ok := p.Labels[key]; ok { |
super super minor nit
@@ -323,7 +345,17 @@ func (t *Topology) countDomains(ctx context.Context, tg *TopologyGroup) error { | |||
func (t *Topology) newForTopologies(p *corev1.Pod) []*TopologyGroup { | |||
var topologyGroups []*TopologyGroup | |||
for _, cs := range p.Spec.TopologySpreadConstraints { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for _, cs := range p.Spec.TopologySpreadConstraints { | |
for _, tsc := range p.Spec.TopologySpreadConstraints { |
this is also minor but why not?
It("should balance pods across a label when discovered from the provisioner (NodeTaintsPolicy=ignore)", func() { | ||
const spreadLabel = "fake-label" | ||
const taintKey = "taint-key" | ||
nodePool.Spec.Template.Spec.Requirements = append(nodePool.Spec.Template.Spec.Requirements, v1.NodeSelectorRequirementWithMinValues{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the tests that are validating schedulability of a pod against a nodepool or node, it might be worth it to add taints and tolerations for the nodepool/node that we are able to schedule to as well so that we ensure that we are exercising the tolerates checks that we have in the topology code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could also see an argument that this is tested below but it probably doesn't hurt -- I could go either way
// should fail to schedule both pods, one pod is scheduled to domain "foo" but the other can't be scheduled to domain "bar" | ||
ExpectSkew(ctx, env.Client, "default", &topology[0]).To(ConsistOf(1)) | ||
}) | ||
It("should balance pods across a label when discovered from the provisioner (NodeTaintsPolicy=honor)", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It("should balance pods across a label when discovered from the provisioner (NodeTaintsPolicy=honor)", func() { | |
It("should balance pods across a label when discovered from the NodePool (NodeTaintsPolicy=honor)", func() { |
consider replacing the use of the word provisioner throughout this file with NodePool :P we got some legacy wording that got kept around due to how long this has been open
ExpectApplied(ctx, env.Client, nodePools[0], nodePools[1]) | ||
ExpectProvisioned(ctx, env.Client, cluster, cloudProvider, prov, pods...) | ||
|
||
// Expect 3 total nodes provisioned, 2 pods schedule to foo, 1 to bar, and 1 to baz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: consider adding a bit more to this comment (1 pod that tolerated the nodepool-1 taint on fot, 1 that tolerated the nodepool-1 taint on bar, etc.
ExpectProvisioned(ctx, env.Client, cluster, cloudProvider, prov, pods...) | ||
|
||
// should schedule all pods to domain "foo", ignoring bar since pods don't tolerate | ||
ExpectSkew(ctx, env.Client, "default", &topology[0]).To(ConsistOf(2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about a test for a set of pods that have two different tolerations and two different NodePools with mutually exclusive taints?
ExpectReconcileSucceeded(ctx, nodeStateController, client.ObjectKeyFromObject(node2)) | ||
|
||
ExpectProvisioned(ctx, env.Client, cluster, cloudProvider, prov) | ||
ignore := corev1.NodeInclusionPolicyHonor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This naming is wrong from the copy
This is awesome work! This is definitely getting close! I think mostly a few small things -- we should write-down things that we want to refactor here since there were some ideas thrown out about how we could move existing nodes into |
Fixes #430
Description
This PR adds support for the following topology spread constraint fields:
matchLabelKeys
nodeAffinityPolicy
nodeTaintsPolicy
How was this change tested?
make test
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.