Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UPSTREAM: <carry>: filter daemonset nodes by namespace node selectors #18989

Merged
merged 1 commit into from
Apr 12, 2018

Conversation

deads2k
Copy link
Contributor

@deads2k deads2k commented Mar 15, 2018

Something to talk about. This places a shim in the upstream controller to check nodes against node selectors to avoid creating extra pods. We can talk about the expense of making the check compared to alternatives, but something concrete to show relative complexity may be useful.

Changing the default policy for all openshift installations seems like a comparatively big deal. The cost of this patch is only borne by those who enable the node limiting features.

@simo5 @mfojtik @tnozicka @liggitt

@openshift-merge-robot openshift-merge-robot added the vendor-update Touching vendor dir or related files label Mar 15, 2018
@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 15, 2018
Copy link
Contributor

@tnozicka tnozicka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments + this needs integration test

@@ -1311,6 +1318,14 @@ func (dsc *DaemonSetsController) nodeShouldRunDaemonPod(node *v1.Node, ds *exten
}
}
}

if matches, matchErr := dsc.namespaceNodeSelectorMatches(node, ds); matchErr != nil {
err = matchErr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return the error immediatelly, someone can add more code in upstream after that and this error can get silently ignored

}
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to check the value from master config as well

@tnozicka
Copy link
Contributor

Given the fact that DSs are restricted to admins and we clearly document that they should disable the project default node selector in the namespace where they create the DS, I am thinking if it's worth to carry this patch for several releases.

I was hoping we could find a way how to let the DS see how the pod would look like post admission and take that into account when matching the selector in general instead of hard coding it into the carry. Or something else that could be taken upstream then couple it to OpenShift.

I hope we run upstream e2e suite in our CI.

@tnozicka
Copy link
Contributor

tnozicka commented Mar 15, 2018

If someone disables the nodeSelector admition plugin, will this refuse to schedule pods to nodes where it should have scheduled them?

@deads2k
Copy link
Contributor Author

deads2k commented Mar 15, 2018

If someone disables the nodeSelector admition plugin, will this refuse to schedule pods to nodes where it should have scheduled them?

Yes. That's the only time I can think of this optimizing "wrong". It is possible to plumb through from the config we have, but I think the likelihood of that for that is extremely low.

@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 16, 2018
@deads2k
Copy link
Contributor Author

deads2k commented Mar 16, 2018

comments + this needs integration test

Because of the node requirements in the controller, this is really sticky to write an integration test for. How about unit tests for the matching logic.

@sjenning
Copy link
Contributor

What problem does this fix? DS not respecting default project node selectors?

@deads2k
Copy link
Contributor Author

deads2k commented Mar 16, 2018

What problem does this fix? DS not respecting default project node selectors?

Yeah. It's causing a lot of pod and event traffic. This doesn't remove any safety, it is purely an optimization in the DS controller. Upstream they are moving daemonset scheduling rules to the scheduler and when they do, the scheduler node selector that @aveshagarwal has been working on should be enforced.

As part of moving that node selector upstream, I'm hoping for @aveshagarwal to have a plan for us to migrate. This issue makes it slightly more important, but we should remove our pre-existing node selector code.

@smarterclayton
Copy link
Contributor

What happens when upstream moves to using scheduler

@deads2k
Copy link
Contributor Author

deads2k commented Mar 17, 2018

What happens when upstream moves to using scheduler

Our carry conflicts and we do this in a similar spot. At that point it will be very obvious to them that they have the same need for node selector respecting.

@aveshagarwal
Copy link
Contributor

Would it still not create an issue where a ns does not specify any node selectors, but cluster level (global) default node selectors are set?

return false
}
}
case !ok && len(dsc.defaultNodeSelectorString) > 0:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aveshagarwal I think the case is handled right here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deads so it seems that it might work with openshift's project node selector which is nodeenv but it would not work with upstream podnodeselector which specifies cluster level default selector in its config file, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deads so it seems that it might work with openshift's project node selector which is nodeenv but it would not work with upstream podnodeselector which specifies cluster level default selector in its config file, right?

Give me a link and I can plumb that through. In concept this is still ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ttps://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/plugin/pkg/admission/podnodeselector/admission.go#L257

So it's a semantically different key in a single shared map separated only by namespace validation that prevents capital letters in namespace names? That's really weird. Plumb-able, but really weird.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being dense, but I am not sure I understand the weirdness, so If you could explain it more, I would be happy to fix it :-).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's a semantically different key in a single shared map separated only by namespace validation that prevents capital letters in namespace names? That's really weird.

Using one map with two semantically different sets of key/value-pairs is weird. It would be better to have a separate field on the struct that indicates that something is a default instead of an override.

@deads2k
Copy link
Contributor Author

deads2k commented Mar 22, 2018

Would it still not create an issue where a ns does not specify any node selectors, but cluster level (global) default node selectors are set?

I think that case is handled here: https://github.com/openshift/origin/pull/18989/files#r176407073. There's some plumbing code to take it from the config.

@smarterclayton another example of "config as status"

@tnozicka
Copy link
Contributor

/hold
(temporarily; so someone doesn't accidentally merges this)

I'd like us to talk this through on Architectural call before we commit to it. And talk it thought upstream as well because even with the move to scheduler I think they will still have the same issue with just a better consequence - the pod doesn't get to kubelet that way, but will still be created for a node that it shouldn't be created for, it just won't be restarted in a loop, only stuck in pending state.

(At this moment, to me, this seems like our only reasonable option to deal with it.)

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 22, 2018
@deads2k
Copy link
Contributor Author

deads2k commented Mar 23, 2018

(At this moment, to me, this seems like our only reasonable option to deal with it.)

I don't think that we should let the RBAC mutation live past 3.9.0. Behavioral drift from the upstream as observed by the end user seems far worse than this optimization on which pods are created and which aren't.

@tnozicka
Copy link
Contributor

tnozicka commented Mar 23, 2018

I don't think that we should let the RBAC mutation live past 3.9.0. Behavioral drift from the upstream as observed by the end user seems far worse than this optimization on which pods are created and which aren't.

Sig apps is on Monday, arch call is on Tuesday, I don't think this will delay us for 3.9.1

I'd prefer to keep RBAC discussions separate as solving this is needed anyways and we might reach agreement easier I think.

@deads2k
Copy link
Contributor Author

deads2k commented Mar 28, 2018

I'd like us to talk this through on Architectural call before we commit to it. And talk it thought upstream as well because even with the move to scheduler I think they will still have the same issue with just a better consequence - the pod doesn't get to kubelet that way, but will still be created for a node that it shouldn't be created for, it just won't be restarted in a loop, only stuck in pending state.

(At this moment, to me, this seems like our only reasonable option to deal with it.)

This pull is an optimization on the normal creation with a clear path to pushing the changes up upstream and eventual re-unification with upstream. The previous "fix" provides neither of those things.

@tnozicka are you still waiting for something here? Do you see a practical alternative that is available today, matches the upstream need for moving to core kube, and allows reunification with upstream code and experience? Barring a strong argument against, I expect to remove the hold tomorrow.

clientset "k8s.io/client-go/kubernetes"
)

func NewNodeSelectorAwareDaemonSetsController(defaultNodeSelector string, namepaceInformer coreinformers.NamespaceInformer, daemonSetInformer extensionsinformers.DaemonSetInformer, historyInformer appsinformers.ControllerRevisionInformer, podInformer coreinformers.PodInformer, nodeInformer coreinformers.NodeInformer, kubeClient clientset.Interface) (*DaemonSetsController, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/defaultNodeSelector/defaultNodeSelectorString/ so it doesn't get confusing with naming when you start assingning that on L21-L23

ns, err := dsc.namespaceLister.Get(ds.Namespace)
if apierrors.IsNotFound(err) {
return false, err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please handle any other errors returned
(I have a feeling that informers don't produce many others or none but handling it explicitly is safer.)

}
}

schedulerNodeSelector, ok := ns.Annotations["scheduler.alpha.kubernetes.io/node-selector"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using

var NamespaceNodeSelectors = []string{"scheduler.alpha.kubernetes.io/node-selector"}
might help us to avoid forgetting to adjust when it moves from alfa to beta

}

func (dsc *DaemonSetsController) nodeSelectorMatches(node *v1.Node, ns *v1.Namespace) bool {
projectNodeSelector, ok := ns.Annotations["openshift.io/node-selector"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to use the constant?

ProjectNodeSelector = "openshift.io/node-selector"

return false
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to handle this as well I guess (as discussed above)

podNodeSelectorPluginConfig:
 clusterDefaultNodeSelector: <node-selectors-labels>
 namespace1: <node-selectors-labels>
 namespace2: <node-selectors-labels>

@tnozicka
Copy link
Contributor

/hold cancel
@deads2k given the situation, I think this is the best and likely also the only fix we can do now to help the situation in OpenShift until this is sorted out upstream (kubernetes/kubernetes#61886)

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 29, 2018
@@ -33,7 +33,9 @@ func startDaemonSetController(ctx ControllerContext) (bool, error) {
if !ctx.AvailableResources[schema.GroupVersionResource{Group: "extensions", Version: "v1beta1", Resource: "daemonsets"}] {
return false, nil
}
dsc, err := daemon.NewDaemonSetsController(
dsc, err := daemon.NewNodeSelectorAwareDaemonSetsController(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deads2k could we have a test that this is actually plugged-in and stays that way? considering we don't have an integration one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deads2k could we have a test that this is actually plugged-in and stays that way? considering we don't have an integration one

I commented back here: #18989 (comment) the node dependency actually makes it impractical to add integration tests for it.

@deads2k deads2k force-pushed the controller-09-ds branch from 43b0f5e to 7b30fd9 Compare April 2, 2018 13:12
@deads2k
Copy link
Contributor Author

deads2k commented Apr 2, 2018

@tnozicka comments addressed. I've stubbed in where the kube plugin config can be detected and did all the later plumbing, but the config itself looks like it's been special cased in multiple locations. I think we can open an issue and @aveshagarwal probably has an example config he can run through and add wiring for.

@deads2k
Copy link
Contributor Author

deads2k commented Apr 4, 2018

@tnozicka ptal. I want to make it for 3.9.1

@tnozicka
Copy link
Contributor

tnozicka commented Apr 4, 2018

@deads2k as it happens I am just reviewing it. Was just about to ask you about the TODO in applyOpenShiftConfigKubeDefaultProjectSelector? Do we ship it for 3.9.1 and fix later?

}

// this is an optimization. It can be filled in later. Looks like there are several special cases for this plugin upstream
// TODO find this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make an issue to track it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make an issue to track it

#19250

if apierrors.IsNotFound(err) {
return false, err
}
// if we had any error, default to the safe option of creating a pod for the node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the safe option is not to create the Pod rather than creating it somewhere it shouldn't be. You can always create it in the next sync but undoing seems worse.

Also you are ignoring that error so we won't know where it's broken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the safe option is not to create the Pod rather than creating it somewhere it shouldn't be. You can always create it in the next sync but undoing seems worse.

Think of this as an optimization. If the method just returned true every time we'd function correctly and be really slow. What we want is say false as often as we can without ever doing it falsely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to think about it as correctness and security issue (although kubelet being a last line of defense here stops it). The project default node selector is a security feature preventing pods from being scheduled to some nodes. If we schedule them there just because we have encountered an error, that seems wrong.

What we want is say false as often as we can without ever doing it falsely.

That would result in creating pod to restricted nodes and deleting them when we stop getting errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to think about it as correctness and security issue (although kubelet being a last line of defense here stops it).

I don't think that we should think of our controllers as agents of security. The security feature is that kubelets reject pods that aren't allowed to run on them. The optimization in the controller is to avoid creating pods that will never succeed.

The controller can never be that agent of security, since the point of action is the kubelet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. According to Clayton we restrict /bind to nodeName so we shouldn't let DS controller bind to nodes where it shouldn't.

  2. Correctness - you shall not create pods for nodes which are not targeted by DS (after implications from node selector admission). And this PR is actually about preventing that.

Take the extreme case. You have 1000 node cluster and default project nodeSelector targeting 5 of them. Because of an error you just created a 1000 Pods instead of 5 and you are going to delete those 995 when the error goes away. (Multiply that by the number of DS in the namespace/cluster.) That doesn't seem right. It puts unwanted load on etcd, scheduler, ...

I see this PR as fixing correctness, not an optimization - that's likely why are opinions differ in this particular case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctness - you shall not create pods for nodes which are not targeted by DS (after implications from node selector admission). And this PR is actually about preventing that.

That is the wrong bar of correctness in this case. If you were authoring the upstream DS controller, maybe. We're shimming in an optimization, and the DS controller's correctness today is "create all the pods I might need". Changing that here would cause incompatible carry behavior.

We have to fail to true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have tests that ensure the kubelet rejects pods that do not match it's selector. That is the security boundary. This filtering is an optimization.

Copy link
Contributor

@tnozicka tnozicka Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I believe upsteam behaviour is broken here I don't agree that we should fallback to upstream when we encounter error with namespace lister, but rather wait for next sync. With this being really corner case I am ok to be overvoted here and I'll ship it once @deads2k adds logging of the error.

}
// if we had any error, default to the safe option of creating a pod for the node.
if err != nil {
return true, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deads2k please log the error

@deads2k
Copy link
Contributor Author

deads2k commented Apr 6, 2018

comments addressed.

@deads2k
Copy link
Contributor Author

deads2k commented Apr 6, 2018

/retest

Copy link
Contributor

@tnozicka tnozicka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @deads2k
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 9, 2018
@deads2k deads2k force-pushed the controller-09-ds branch from 5f7e001 to f74ad81 Compare April 9, 2018 12:43
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Apr 9, 2018
@deads2k
Copy link
Contributor Author

deads2k commented Apr 9, 2018

/retest

@tnozicka
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 10, 2018
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, tnozicka

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@deads2k
Copy link
Contributor Author

deads2k commented Apr 10, 2018

/retest

1 similar comment
@deads2k
Copy link
Contributor Author

deads2k commented Apr 11, 2018

/retest

@deads2k
Copy link
Contributor Author

deads2k commented Apr 11, 2018

/test all

@deads2k
Copy link
Contributor Author

deads2k commented Apr 11, 2018

/retest

@openshift-ci-robot
Copy link

openshift-ci-robot commented Apr 11, 2018

@deads2k: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/openshift-jenkins/extended_networking_minimal 3a9adbe link /test extended_networking_minimal

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@deads2k
Copy link
Contributor Author

deads2k commented Apr 12, 2018

/retest

@openshift-merge-robot openshift-merge-robot merged commit 516f31f into openshift:master Apr 12, 2018
@deads2k deads2k deleted the controller-09-ds branch July 3, 2018 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. vendor-update Touching vendor dir or related files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants