-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UPSTREAM: <carry>: filter daemonset nodes by namespace node selectors #18989
UPSTREAM: <carry>: filter daemonset nodes by namespace node selectors #18989
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments + this needs integration test
@@ -1311,6 +1318,14 @@ func (dsc *DaemonSetsController) nodeShouldRunDaemonPod(node *v1.Node, ds *exten | |||
} | |||
} | |||
} | |||
|
|||
if matches, matchErr := dsc.namespaceNodeSelectorMatches(node, ds); matchErr != nil { | |||
err = matchErr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return the error immediatelly, someone can add more code in upstream after that and this error can get silently ignored
} | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to check the value from master config as well
Given the fact that DSs are restricted to admins and we clearly document that they should disable the project default node selector in the namespace where they create the DS, I am thinking if it's worth to carry this patch for several releases. I was hoping we could find a way how to let the DS see how the pod would look like post admission and take that into account when matching the selector in general instead of hard coding it into the carry. Or something else that could be taken upstream then couple it to OpenShift. I hope we run upstream e2e suite in our CI. |
If someone disables the nodeSelector admition plugin, will this refuse to schedule pods to nodes where it should have scheduled them? |
Yes. That's the only time I can think of this optimizing "wrong". It is possible to plumb through from the config we have, but I think the likelihood of that for that is extremely low. |
3a9adbe
to
64dcf2f
Compare
Because of the node requirements in the controller, this is really sticky to write an integration test for. How about unit tests for the matching logic. |
64dcf2f
to
43b0f5e
Compare
What problem does this fix? DS not respecting default project node selectors? |
Yeah. It's causing a lot of pod and event traffic. This doesn't remove any safety, it is purely an optimization in the DS controller. Upstream they are moving daemonset scheduling rules to the scheduler and when they do, the scheduler node selector that @aveshagarwal has been working on should be enforced. As part of moving that node selector upstream, I'm hoping for @aveshagarwal to have a plan for us to migrate. This issue makes it slightly more important, but we should remove our pre-existing node selector code. |
What happens when upstream moves to using scheduler |
Our carry conflicts and we do this in a similar spot. At that point it will be very obvious to them that they have the same need for node selector respecting. |
Would it still not create an issue where a ns does not specify any node selectors, but cluster level (global) default node selectors are set? |
return false | ||
} | ||
} | ||
case !ok && len(dsc.defaultNodeSelectorString) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aveshagarwal I think the case is handled right here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deads so it seems that it might work with openshift's project node selector which is nodeenv
but it would not work with upstream podnodeselector which specifies cluster level default selector in its config file, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deads so it seems that it might work with openshift's project node selector which is nodeenv but it would not work with upstream podnodeselector which specifies cluster level default selector in its config file, right?
Give me a link and I can plumb that through. In concept this is still ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ttps://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/plugin/pkg/admission/podnodeselector/admission.go#L257
So it's a semantically different key in a single shared map separated only by namespace validation that prevents capital letters in namespace names? That's really weird. Plumb-able, but really weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being dense, but I am not sure I understand the weirdness, so If you could explain it more, I would be happy to fix it :-).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it's a semantically different key in a single shared map separated only by namespace validation that prevents capital letters in namespace names? That's really weird.
Using one map with two semantically different sets of key/value-pairs is weird. It would be better to have a separate field on the struct that indicates that something is a default instead of an override.
I think that case is handled here: https://github.com/openshift/origin/pull/18989/files#r176407073. There's some plumbing code to take it from the config. @smarterclayton another example of "config as status" |
/hold I'd like us to talk this through on Architectural call before we commit to it. And talk it thought upstream as well because even with the move to scheduler I think they will still have the same issue with just a better consequence - the pod doesn't get to kubelet that way, but will still be created for a node that it shouldn't be created for, it just won't be restarted in a loop, only stuck in pending state. (At this moment, to me, this seems like our only reasonable option to deal with it.) |
I don't think that we should let the RBAC mutation live past 3.9.0. Behavioral drift from the upstream as observed by the end user seems far worse than this optimization on which pods are created and which aren't. |
Sig apps is on Monday, arch call is on Tuesday, I don't think this will delay us for 3.9.1 I'd prefer to keep RBAC discussions separate as solving this is needed anyways and we might reach agreement easier I think. |
This pull is an optimization on the normal creation with a clear path to pushing the changes up upstream and eventual re-unification with upstream. The previous "fix" provides neither of those things. @tnozicka are you still waiting for something here? Do you see a practical alternative that is available today, matches the upstream need for moving to core kube, and allows reunification with upstream code and experience? Barring a strong argument against, I expect to remove the hold tomorrow. |
clientset "k8s.io/client-go/kubernetes" | ||
) | ||
|
||
func NewNodeSelectorAwareDaemonSetsController(defaultNodeSelector string, namepaceInformer coreinformers.NamespaceInformer, daemonSetInformer extensionsinformers.DaemonSetInformer, historyInformer appsinformers.ControllerRevisionInformer, podInformer coreinformers.PodInformer, nodeInformer coreinformers.NodeInformer, kubeClient clientset.Interface) (*DaemonSetsController, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/defaultNodeSelector/defaultNodeSelectorString/ so it doesn't get confusing with naming when you start assingning that on L21-L23
ns, err := dsc.namespaceLister.Get(ds.Namespace) | ||
if apierrors.IsNotFound(err) { | ||
return false, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please handle any other errors returned
(I have a feeling that informers don't produce many others or none but handling it explicitly is safer.)
} | ||
} | ||
|
||
schedulerNodeSelector, ok := ns.Annotations["scheduler.alpha.kubernetes.io/node-selector"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using
origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/podnodeselector/admission.go
Line 41 in 710998e
var NamespaceNodeSelectors = []string{"scheduler.alpha.kubernetes.io/node-selector"} |
} | ||
|
||
func (dsc *DaemonSetsController) nodeSelectorMatches(node *v1.Node, ns *v1.Namespace) bool { | ||
projectNodeSelector, ok := ns.Annotations["openshift.io/node-selector"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason not to use the constant?
origin/pkg/project/apis/project/types.go
Line 63 in 4bc612e
ProjectNodeSelector = "openshift.io/node-selector" |
return false | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to handle this as well I guess (as discussed above)
podNodeSelectorPluginConfig:
clusterDefaultNodeSelector: <node-selectors-labels>
namespace1: <node-selectors-labels>
namespace2: <node-selectors-labels>
/hold cancel |
@@ -33,7 +33,9 @@ func startDaemonSetController(ctx ControllerContext) (bool, error) { | |||
if !ctx.AvailableResources[schema.GroupVersionResource{Group: "extensions", Version: "v1beta1", Resource: "daemonsets"}] { | |||
return false, nil | |||
} | |||
dsc, err := daemon.NewDaemonSetsController( | |||
dsc, err := daemon.NewNodeSelectorAwareDaemonSetsController( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deads2k could we have a test that this is actually plugged-in and stays that way? considering we don't have an integration one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deads2k could we have a test that this is actually plugged-in and stays that way? considering we don't have an integration one
I commented back here: #18989 (comment) the node dependency actually makes it impractical to add integration tests for it.
43b0f5e
to
7b30fd9
Compare
@tnozicka comments addressed. I've stubbed in where the kube plugin config can be detected and did all the later plumbing, but the config itself looks like it's been special cased in multiple locations. I think we can open an issue and @aveshagarwal probably has an example config he can run through and add wiring for. |
@tnozicka ptal. I want to make it for 3.9.1 |
@deads2k as it happens I am just reviewing it. Was just about to ask you about the TODO in applyOpenShiftConfigKubeDefaultProjectSelector? Do we ship it for 3.9.1 and fix later? |
} | ||
|
||
// this is an optimization. It can be filled in later. Looks like there are several special cases for this plugin upstream | ||
// TODO find this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please make an issue to track it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please make an issue to track it
if apierrors.IsNotFound(err) { | ||
return false, err | ||
} | ||
// if we had any error, default to the safe option of creating a pod for the node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the safe option is not to create the Pod rather than creating it somewhere it shouldn't be. You can always create it in the next sync but undoing seems worse.
Also you are ignoring that error so we won't know where it's broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the safe option is not to create the Pod rather than creating it somewhere it shouldn't be. You can always create it in the next sync but undoing seems worse.
Think of this as an optimization. If the method just returned true every time we'd function correctly and be really slow. What we want is say false
as often as we can without ever doing it falsely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to think about it as correctness and security issue (although kubelet being a last line of defense here stops it). The project default node selector is a security feature preventing pods from being scheduled to some nodes. If we schedule them there just because we have encountered an error, that seems wrong.
What we want is say false as often as we can without ever doing it falsely.
That would result in creating pod to restricted nodes and deleting them when we stop getting errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to think about it as correctness and security issue (although kubelet being a last line of defense here stops it).
I don't think that we should think of our controllers as agents of security. The security feature is that kubelets reject pods that aren't allowed to run on them. The optimization in the controller is to avoid creating pods that will never succeed.
The controller can never be that agent of security, since the point of action is the kubelet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
According to Clayton we restrict /bind to nodeName so we shouldn't let DS controller bind to nodes where it shouldn't.
-
Correctness - you shall not create pods for nodes which are not targeted by DS (after implications from node selector admission). And this PR is actually about preventing that.
Take the extreme case. You have 1000 node cluster and default project nodeSelector targeting 5 of them. Because of an error you just created a 1000 Pods instead of 5 and you are going to delete those 995 when the error goes away. (Multiply that by the number of DS in the namespace/cluster.) That doesn't seem right. It puts unwanted load on etcd, scheduler, ...
I see this PR as fixing correctness, not an optimization - that's likely why are opinions differ in this particular case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correctness - you shall not create pods for nodes which are not targeted by DS (after implications from node selector admission). And this PR is actually about preventing that.
That is the wrong bar of correctness in this case. If you were authoring the upstream DS controller, maybe. We're shimming in an optimization, and the DS controller's correctness today is "create all the pods I might need". Changing that here would cause incompatible carry behavior.
We have to fail to true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have tests that ensure the kubelet rejects pods that do not match it's selector. That is the security boundary. This filtering is an optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I believe upsteam behaviour is broken here I don't agree that we should fallback to upstream when we encounter error with namespace lister, but rather wait for next sync. With this being really corner case I am ok to be overvoted here and I'll ship it once @deads2k adds logging of the error.
} | ||
// if we had any error, default to the safe option of creating a pod for the node. | ||
if err != nil { | ||
return true, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deads2k please log the error
7b30fd9
to
5f7e001
Compare
comments addressed. |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @deads2k
/lgtm
5f7e001
to
f74ad81
Compare
/retest |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, tnozicka The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
1 similar comment
/retest |
/test all |
/retest |
@deads2k: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest |
Something to talk about. This places a shim in the upstream controller to check nodes against node selectors to avoid creating extra pods. We can talk about the expense of making the check compared to alternatives, but something concrete to show relative complexity may be useful.
Changing the default policy for all openshift installations seems like a comparatively big deal. The cost of this patch is only borne by those who enable the node limiting features.
@simo5 @mfojtik @tnozicka @liggitt