UPSTREAM: <carry>: filter daemonset nodes by namespace node selectors #18989

deads2k · 2018-03-15T13:21:34Z

Something to talk about. This places a shim in the upstream controller to check nodes against node selectors to avoid creating extra pods. We can talk about the expense of making the check compared to alternatives, but something concrete to show relative complexity may be useful.

Changing the default policy for all openshift installations seems like a comparatively big deal. The cost of this patch is only borne by those who enable the node limiting features.

@simo5 @mfojtik @tnozicka @liggitt

tnozicka

comments + this needs integration test

tnozicka · 2018-03-15T14:49:06Z

vendor/k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go

@@ -1311,6 +1318,14 @@ func (dsc *DaemonSetsController) nodeShouldRunDaemonPod(node *v1.Node, ds *exten
 			}
 		}
 	}
+
+	if matches, matchErr := dsc.namespaceNodeSelectorMatches(node, ds); matchErr != nil {
+		err = matchErr


return the error immediatelly, someone can add more code in upstream after that and this error can get silently ignored

tnozicka · 2018-03-15T15:01:26Z

vendor/k8s.io/kubernetes/pkg/controller/daemon/patch_nodeselector.go

+			}
+		}
+	}
+


you need to check the value from master config as well

tnozicka · 2018-03-15T15:11:06Z

Given the fact that DSs are restricted to admins and we clearly document that they should disable the project default node selector in the namespace where they create the DS, I am thinking if it's worth to carry this patch for several releases.

I was hoping we could find a way how to let the DS see how the pod would look like post admission and take that into account when matching the selector in general instead of hard coding it into the carry. Or something else that could be taken upstream then couple it to OpenShift.

I hope we run upstream e2e suite in our CI.

tnozicka · 2018-03-15T19:35:10Z

If someone disables the nodeSelector admition plugin, will this refuse to schedule pods to nodes where it should have scheduled them?

deads2k · 2018-03-15T20:08:42Z

If someone disables the nodeSelector admition plugin, will this refuse to schedule pods to nodes where it should have scheduled them?

Yes. That's the only time I can think of this optimizing "wrong". It is possible to plumb through from the config we have, but I think the likelihood of that for that is extremely low.

deads2k · 2018-03-16T15:21:50Z

comments + this needs integration test

Because of the node requirements in the controller, this is really sticky to write an integration test for. How about unit tests for the matching logic.

sjenning · 2018-03-16T20:19:36Z

What problem does this fix? DS not respecting default project node selectors?

deads2k · 2018-03-16T22:54:07Z

What problem does this fix? DS not respecting default project node selectors?

Yeah. It's causing a lot of pod and event traffic. This doesn't remove any safety, it is purely an optimization in the DS controller. Upstream they are moving daemonset scheduling rules to the scheduler and when they do, the scheduler node selector that @aveshagarwal has been working on should be enforced.

As part of moving that node selector upstream, I'm hoping for @aveshagarwal to have a plan for us to migrate. This issue makes it slightly more important, but we should remove our pre-existing node selector code.

smarterclayton · 2018-03-17T18:08:29Z

What happens when upstream moves to using scheduler

deads2k · 2018-03-17T20:24:41Z

What happens when upstream moves to using scheduler

Our carry conflicts and we do this in a similar spot. At that point it will be very obvious to them that they have the same need for node selector respecting.

aveshagarwal · 2018-03-21T15:23:48Z

Would it still not create an issue where a ns does not specify any node selectors, but cluster level (global) default node selectors are set?

deads2k · 2018-03-22T12:48:59Z

vendor/k8s.io/kubernetes/pkg/controller/daemon/patch_nodeselector.go

+				return false
+			}
+		}
+	case !ok && len(dsc.defaultNodeSelectorString) > 0:


@aveshagarwal I think the case is handled right here.

@deads so it seems that it might work with openshift's project node selector which is nodeenv but it would not work with upstream podnodeselector which specifies cluster level default selector in its config file, right?

@deads so it seems that it might work with openshift's project node selector which is nodeenv but it would not work with upstream podnodeselector which specifies cluster level default selector in its config file, right?

Give me a link and I can plumb that through. In concept this is still ok.

https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/plugin/pkg/admission/podnodeselector/admission.go#L257

ttps://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/plugin/pkg/admission/podnodeselector/admission.go#L257

So it's a semantically different key in a single shared map separated only by namespace validation that prevents capital letters in namespace names? That's really weird. Plumb-able, but really weird.

Sorry for being dense, but I am not sure I understand the weirdness, so If you could explain it more, I would be happy to fix it :-).

So it's a semantically different key in a single shared map separated only by namespace validation that prevents capital letters in namespace names? That's really weird.

Using one map with two semantically different sets of key/value-pairs is weird. It would be better to have a separate field on the struct that indicates that something is a default instead of an override.

deads2k · 2018-03-22T12:49:34Z

Would it still not create an issue where a ns does not specify any node selectors, but cluster level (global) default node selectors are set?

I think that case is handled here: https://github.com/openshift/origin/pull/18989/files#r176407073. There's some plumbing code to take it from the config.

@smarterclayton another example of "config as status"

tnozicka · 2018-03-22T15:38:09Z

/hold
(temporarily; so someone doesn't accidentally merges this)

I'd like us to talk this through on Architectural call before we commit to it. And talk it thought upstream as well because even with the move to scheduler I think they will still have the same issue with just a better consequence - the pod doesn't get to kubelet that way, but will still be created for a node that it shouldn't be created for, it just won't be restarted in a loop, only stuck in pending state.

(At this moment, to me, this seems like our only reasonable option to deal with it.)

deads2k · 2018-03-23T00:00:35Z

(At this moment, to me, this seems like our only reasonable option to deal with it.)

I don't think that we should let the RBAC mutation live past 3.9.0. Behavioral drift from the upstream as observed by the end user seems far worse than this optimization on which pods are created and which aren't.

tnozicka · 2018-03-23T15:14:00Z

I don't think that we should let the RBAC mutation live past 3.9.0. Behavioral drift from the upstream as observed by the end user seems far worse than this optimization on which pods are created and which aren't.

Sig apps is on Monday, arch call is on Tuesday, I don't think this will delay us for 3.9.1

I'd prefer to keep RBAC discussions separate as solving this is needed anyways and we might reach agreement easier I think.

deads2k · 2018-03-28T22:56:34Z

I'd like us to talk this through on Architectural call before we commit to it. And talk it thought upstream as well because even with the move to scheduler I think they will still have the same issue with just a better consequence - the pod doesn't get to kubelet that way, but will still be created for a node that it shouldn't be created for, it just won't be restarted in a loop, only stuck in pending state.

(At this moment, to me, this seems like our only reasonable option to deal with it.)

This pull is an optimization on the normal creation with a clear path to pushing the changes up upstream and eventual re-unification with upstream. The previous "fix" provides neither of those things.

@tnozicka are you still waiting for something here? Do you see a practical alternative that is available today, matches the upstream need for moving to core kube, and allows reunification with upstream code and experience? Barring a strong argument against, I expect to remove the hold tomorrow.

tnozicka · 2018-03-29T10:40:45Z

vendor/k8s.io/kubernetes/pkg/controller/daemon/patch_nodeselector.go

+	clientset "k8s.io/client-go/kubernetes"
+)
+
+func NewNodeSelectorAwareDaemonSetsController(defaultNodeSelector string, namepaceInformer coreinformers.NamespaceInformer, daemonSetInformer extensionsinformers.DaemonSetInformer, historyInformer appsinformers.ControllerRevisionInformer, podInformer coreinformers.PodInformer, nodeInformer coreinformers.NodeInformer, kubeClient clientset.Interface) (*DaemonSetsController, error) {


s/defaultNodeSelector/defaultNodeSelectorString/ so it doesn't get confusing with naming when you start assingning that on L21-L23

tnozicka · 2018-03-29T10:48:36Z

vendor/k8s.io/kubernetes/pkg/controller/daemon/patch_nodeselector.go

+	ns, err := dsc.namespaceLister.Get(ds.Namespace)
+	if apierrors.IsNotFound(err) {
+		return false, err
+	}


please handle any other errors returned
(I have a feeling that informers don't produce many others or none but handling it explicitly is safer.)

tnozicka · 2018-03-29T11:16:32Z

vendor/k8s.io/kubernetes/pkg/controller/daemon/patch_nodeselector.go

+		}
+	}
+
+	schedulerNodeSelector, ok := ns.Annotations["scheduler.alpha.kubernetes.io/node-selector"]


using

origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/podnodeselector/admission.go

Line 41 in 710998e

var NamespaceNodeSelectors = []string{"scheduler.alpha.kubernetes.io/node-selector"}

might help us to avoid forgetting to adjust when it moves from alfa to beta

tnozicka · 2018-03-29T11:36:13Z

vendor/k8s.io/kubernetes/pkg/controller/daemon/patch_nodeselector.go

+}
+
+func (dsc *DaemonSetsController) nodeSelectorMatches(node *v1.Node, ns *v1.Namespace) bool {
+	projectNodeSelector, ok := ns.Annotations["openshift.io/node-selector"]


any reason not to use the constant?

origin/pkg/project/apis/project/types.go

Line 63 in 4bc612e

ProjectNodeSelector = "openshift.io/node-selector"

tnozicka · 2018-03-29T12:15:32Z

vendor/k8s.io/kubernetes/pkg/controller/daemon/patch_nodeselector.go

+				return false
+			}
+		}
+	}


we need to handle this as well I guess (as discussed above)

podNodeSelectorPluginConfig: clusterDefaultNodeSelector: <node-selectors-labels> namespace1: <node-selectors-labels> namespace2: <node-selectors-labels>

tnozicka · 2018-03-29T12:23:14Z

/hold cancel
@deads2k given the situation, I think this is the best and likely also the only fix we can do now to help the situation in OpenShift until this is sorted out upstream (kubernetes/kubernetes#61886)

tnozicka · 2018-03-29T12:31:07Z

vendor/k8s.io/kubernetes/cmd/kube-controller-manager/app/extensions.go

@@ -33,7 +33,9 @@ func startDaemonSetController(ctx ControllerContext) (bool, error) {
 	if !ctx.AvailableResources[schema.GroupVersionResource{Group: "extensions", Version: "v1beta1", Resource: "daemonsets"}] {
 		return false, nil
 	}
-	dsc, err := daemon.NewDaemonSetsController(
+	dsc, err := daemon.NewNodeSelectorAwareDaemonSetsController(


@deads2k could we have a test that this is actually plugged-in and stays that way? considering we don't have an integration one

@deads2k could we have a test that this is actually plugged-in and stays that way? considering we don't have an integration one

I commented back here: #18989 (comment) the node dependency actually makes it impractical to add integration tests for it.

deads2k · 2018-04-02T13:13:26Z

@tnozicka comments addressed. I've stubbed in where the kube plugin config can be detected and did all the later plumbing, but the config itself looks like it's been special cased in multiple locations. I think we can open an issue and @aveshagarwal probably has an example config he can run through and add wiring for.

deads2k · 2018-04-04T20:14:41Z

@tnozicka ptal. I want to make it for 3.9.1

tnozicka · 2018-04-04T20:16:44Z

@deads2k as it happens I am just reviewing it. Was just about to ask you about the TODO in applyOpenShiftConfigKubeDefaultProjectSelector? Do we ship it for 3.9.1 and fix later?

tnozicka · 2018-04-04T20:17:34Z

vendor/k8s.io/kubernetes/cmd/kube-controller-manager/app/patch_flags.go

+}
+
+// this is an optimization.  It can be filled in later.  Looks like there are several special cases for this plugin upstream
+// TODO find this


please make an issue to track it

please make an issue to track it

#19250

tnozicka · 2018-04-04T20:25:46Z

vendor/k8s.io/kubernetes/pkg/controller/daemon/patch_nodeselector.go

+	if apierrors.IsNotFound(err) {
+		return false, err
+	}
+	// if we had any error, default to the safe option of creating a pod for the node.


I think the safe option is not to create the Pod rather than creating it somewhere it shouldn't be. You can always create it in the next sync but undoing seems worse.

Also you are ignoring that error so we won't know where it's broken.

I think the safe option is not to create the Pod rather than creating it somewhere it shouldn't be. You can always create it in the next sync but undoing seems worse.

Think of this as an optimization. If the method just returned true every time we'd function correctly and be really slow. What we want is say false as often as we can without ever doing it falsely.

I tend to think about it as correctness and security issue (although kubelet being a last line of defense here stops it). The project default node selector is a security feature preventing pods from being scheduled to some nodes. If we schedule them there just because we have encountered an error, that seems wrong.

What we want is say false as often as we can without ever doing it falsely.

That would result in creating pod to restricted nodes and deleting them when we stop getting errors.

I tend to think about it as correctness and security issue (although kubelet being a last line of defense here stops it).

I don't think that we should think of our controllers as agents of security. The security feature is that kubelets reject pods that aren't allowed to run on them. The optimization in the controller is to avoid creating pods that will never succeed.

The controller can never be that agent of security, since the point of action is the kubelet

According to Clayton we restrict /bind to nodeName so we shouldn't let DS controller bind to nodes where it shouldn't.

Correctness - you shall not create pods for nodes which are not targeted by DS (after implications from node selector admission). And this PR is actually about preventing that.

Take the extreme case. You have 1000 node cluster and default project nodeSelector targeting 5 of them. Because of an error you just created a 1000 Pods instead of 5 and you are going to delete those 995 when the error goes away. (Multiply that by the number of DS in the namespace/cluster.) That doesn't seem right. It puts unwanted load on etcd, scheduler, ...

I see this PR as fixing correctness, not an optimization - that's likely why are opinions differ in this particular case.

Correctness - you shall not create pods for nodes which are not targeted by DS (after implications from node selector admission). And this PR is actually about preventing that.

That is the wrong bar of correctness in this case. If you were authoring the upstream DS controller, maybe. We're shimming in an optimization, and the DS controller's correctness today is "create all the pods I might need". Changing that here would cause incompatible carry behavior.

We have to fail to true.

We have tests that ensure the kubelet rejects pods that do not match it's selector. That is the security boundary. This filtering is an optimization.

Since I believe upsteam behaviour is broken here I don't agree that we should fallback to upstream when we encounter error with namespace lister, but rather wait for next sync. With this being really corner case I am ok to be overvoted here and I'll ship it once @deads2k adds logging of the error.

tnozicka · 2018-04-05T14:51:39Z

vendor/k8s.io/kubernetes/pkg/controller/daemon/patch_nodeselector.go

+	}
+	// if we had any error, default to the safe option of creating a pod for the node.
+	if err != nil {
+		return true, nil


@deads2k please log the error

deads2k · 2018-04-06T19:09:15Z

comments addressed.

deads2k · 2018-04-06T20:32:03Z

/retest

tnozicka

thanks @deads2k
/lgtm

deads2k · 2018-04-09T20:43:23Z

/retest

tnozicka · 2018-04-10T12:01:12Z

/lgtm

openshift-ci-robot · 2018-04-10T12:01:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, tnozicka

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~vendor/k8s.io/kubernetes/cmd/kube-controller-manager/OWNERS~~ [deads2k]
~~vendor/k8s.io/kubernetes/pkg/controller/OWNERS~~ [deads2k]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

deads2k · 2018-04-10T13:38:02Z

/retest

deads2k · 2018-04-11T15:16:10Z

/retest

deads2k · 2018-04-11T15:16:16Z

/test all

deads2k · 2018-04-11T18:57:19Z

/retest

openshift-ci-robot · 2018-04-11T22:01:00Z

@deads2k: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/openshift-jenkins/extended_networking_minimal	`3a9adbe`	link	`/test extended_networking_minimal`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

deads2k · 2018-04-12T11:55:32Z

/retest

openshift-ci-robot requested review from jsafrane and sjenning March 15, 2018 13:21

openshift-merge-robot added the vendor-update Touching vendor dir or related files label Mar 15, 2018

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 15, 2018

tnozicka reviewed Mar 15, 2018

View reviewed changes

deads2k force-pushed the controller-09-ds branch from 3a9adbe to 64dcf2f Compare March 16, 2018 15:17

openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 16, 2018

deads2k force-pushed the controller-09-ds branch from 64dcf2f to 43b0f5e Compare March 16, 2018 15:48

deads2k commented Mar 22, 2018

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 22, 2018

tnozicka suggested changes Mar 29, 2018

View reviewed changes

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 29, 2018

tnozicka reviewed Mar 29, 2018

View reviewed changes

deads2k force-pushed the controller-09-ds branch from 43b0f5e to 7b30fd9 Compare April 2, 2018 13:12

tnozicka reviewed Apr 4, 2018

View reviewed changes

tnozicka reviewed Apr 5, 2018

View reviewed changes

deads2k force-pushed the controller-09-ds branch from 7b30fd9 to 5f7e001 Compare April 6, 2018 19:02

deads2k mentioned this pull request Apr 6, 2018

Update the daemonset controller to understand upstream admission config format #19250

Open

tnozicka approved these changes Apr 9, 2018

View reviewed changes

openshift-ci-robot assigned tnozicka Apr 9, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 9, 2018

UPSTREAM: <carry>: filter daemonset nodes by namespace node selectors

f74ad81

deads2k force-pushed the controller-09-ds branch from 5f7e001 to f74ad81 Compare April 9, 2018 12:43

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Apr 9, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 10, 2018

openshift-merge-robot merged commit 516f31f into openshift:master Apr 12, 2018

deads2k deleted the controller-09-ds branch July 3, 2018 17:47

UPSTREAM: <carry>: filter daemonset nodes by namespace node selectors #18989

UPSTREAM: <carry>: filter daemonset nodes by namespace node selectors #18989

Conversation

deads2k commented Mar 15, 2018

tnozicka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tnozicka commented Mar 15, 2018

tnozicka commented Mar 15, 2018 • edited Loading

deads2k commented Mar 15, 2018

deads2k commented Mar 16, 2018

sjenning commented Mar 16, 2018

deads2k commented Mar 16, 2018

smarterclayton commented Mar 17, 2018

deads2k commented Mar 17, 2018

aveshagarwal commented Mar 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deads2k commented Mar 22, 2018

tnozicka commented Mar 22, 2018

deads2k commented Mar 23, 2018

tnozicka commented Mar 23, 2018 • edited Loading

deads2k commented Mar 28, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tnozicka commented Mar 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deads2k commented Apr 2, 2018

deads2k commented Apr 4, 2018

tnozicka commented Apr 4, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tnozicka Apr 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deads2k commented Apr 6, 2018

deads2k commented Apr 6, 2018

tnozicka left a comment

Choose a reason for hiding this comment

deads2k commented Apr 9, 2018

tnozicka commented Apr 10, 2018

openshift-ci-robot commented Apr 10, 2018

deads2k commented Apr 10, 2018

deads2k commented Apr 11, 2018

deads2k commented Apr 11, 2018

deads2k commented Apr 11, 2018

openshift-ci-robot commented Apr 11, 2018 • edited Loading

deads2k commented Apr 12, 2018

tnozicka commented Mar 15, 2018 •

edited

Loading

tnozicka commented Mar 23, 2018 •

edited

Loading

tnozicka commented Apr 4, 2018 •

edited

Loading

tnozicka Apr 6, 2018 •

edited

Loading

openshift-ci-robot commented Apr 11, 2018 •

edited

Loading