-
Notifications
You must be signed in to change notification settings - Fork 126
WIP: Rebase 1.35 #2523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
WIP: Rebase 1.35 #2523
Conversation
…-infra-container-image Remove deprecated pod-infra-container-image flag
…ner-status kubelet: add nil check for ContainerStatus in GetContainerStatus
This raises the number of allowed taints per device to 16 by lowering the number of allowed devices to 64 per ResourceSlice if (and only if!) taints are used. "effect: None" and DeviceTaintRule status with conditions get added to support giving feedback to admins. Instead of merely adding the new effect value, this also changes validation of the enum so that unknown values are valid if they were already stored. This will simplify adding new effects in the future because validation won't fail for them after a downgrade. Consumers must treat them like this new None effect, i.e. ignore them.
When the ResourceSlice no longer exists, the ResourceSlice tracker didn't and couldn't report the tainted devices even if they are allocated and in use. The controller must keep track of DeviceTaintRules itself and handle this scenario. In this scenario it is impossible to evaluation CEL expressions because the necessary device attributes aren't available. We could: - Copy them in the allocation result: too large, big change. - Limit usage of CEL expressions to rules with no eviction: inconsistent. - Remove the fields which cannot be supported well. The last option is chosen. The tracker is now no longer needed by the eviction controller. Reading directly from the informer means that we cannot assume that pointers are consistent. We have to track ResourceSlices by their name, not their pointer.
While at it, ensure that future unknown effects are treating like the None effect.
We know how often the controller should get a pod, let's check it. Must run before we do our own GET call.
The approach copied from node taint eviction was to fire off one goroutine per pod the intended time. This leads to the "thundering herd" problem: when a single taint causes eviction of several pods and those all have no or the same toleration grace period, then they all get deleted concurrently at the same time. For node taint eviction that is limited by the number of pods per node, which is typically ~100. In an integration test, that already led to problems with watchers: cacher.go:855] cacher (pods): 100 objects queued in incoming channel. cache_watcher.go:203] Forcing pods watcher close due to unresponsiveness: key: "/pods/", labels: "", fields: "". len(c.input) = 10, len(c.result) = 10, graceful = false It also causes spikes in memory consumption (mostly the 2KB stack per goroutine plus closure) with no upper limit. Using a workqueue makes concurrency more deterministic because there is an upper limit. In the integration test, 10 workers kept the watch active. Another advantage is that failures to evict the pod get retried with exponential backoff per affected pod forever. Previously, evicting was tried a few times with a fixed rate and then the controller gave up. If the apiserver was down long enough, pods didn't get evicted.
Signed-off-by: Swati Sehgal <[email protected]>
this feature gate was meant to be ephemeral, and only was used for guaranteeing a cluster admin didn't accidentally relax PSA policies before the kubelet would deny a pod was created if it didn't support user namespaces. As of kube 1.33, the supported apiserver version skew of n-3 guarantees that all supported kubelets are of 1.30 or later, meaning they do this. Now, we can unconditionally relax PSA policy if a pod is in a user namespace. This PR reserves older policies default behavior by never relaxing Signed-off-by: Peter Hunt <[email protected]>
As well as fix procMount restricted test
Signed-off-by: Peter Hunt <[email protected]>
kubelet/userns: Print podUID on errors
[InPlacePodVerticalScaling] Reformat a couple of e2e tests
…gconditions-flake test: fix flake in DRA DeviceBindingCondition
…or kubeadm v1beta4 config
Signed-off-by: Anish Ramasekar <[email protected]>
…wks_metric_comment oidc: fix jwks metric name in comment
Signed-off-by: Sunyanan Choochotkaew <[email protected]>
Signed-off-by: Omer Aplatony <[email protected]>
Specifically the new AddTreeConstructionNodeArgsTransformer and SpecPriority in Ginkgo will be useful. Gomega gets updated to keep up-to-date.
This avoids the risk of having a slow test started towards the end of a run, which then would cause the run to take longer. When started early they can run in parallel to other tests. In serial runs it doesn't matter. The implementation maps the Slow label to the new ginkgo.SpecPriority. The default is 0. Tests with priority 1 run first.
E2E: run slow tests first, using new Ginkgo
…raint [DRA] Fix DistinctAttributeConstraint match comparision with value
As described in https://go.dev/issue/74633 a misbehaving proxy server can cause memory exhaustion in the client. The fix in https://go.dev/cl/698915 only covers http.Transport users. Apply the same limit here. Limit the size of the response headers the proxy server can send us.
This reverts commit cff07e7. The commit caused several kubeadm jobs to fail while executing all conformance tests (including slow ones) in parallel. Sometimes execution took longer and ran into the overall timeout, sometimes there was: [FAILED] Expected <int>: 440 to be == <int>: 400 In [It] at: k8s.io/kubernetes/test/e2e/apimachinery/chunking.go:202 It looks like the tests are flaky and/or reveal a real bug when slow tests run all in parallel at the same time. This should work, but doesn't right now, so let's revert until that problem is fixed.
Temporarily disable volume group snapshot (VGS) tests. We need to rebase the external-snapshotter that updates VGS API in an incompatible way (v1beta1 -> v1beta2 without automatic conversion). The API is TechPreviewNoUpgrade, so its breakage won't affect any customer. We need to update all components that provide or use the API while the tests are off, they would fail spectacularly.
… is available in the already-imported fscommon package Squash into `UPSTREAM: <carry>: disable load balancing on created cgroups when managed is enabled` during review phases of rebase.
This appears to have been missed in a previous `UPSTREAM: <carry>: remove annotation framework in favor of environment selectors` commit (PR# 2393). It should be squashed during the next rebase.
…EnableResourceSizeEstimation Squash into `UPSTREAM: <carry>: retry etcd Unavailable errors` when appropriate during the rebase cycle.
Squash into `UPSTREAM: <carry>: provide events, messages, and bodies for probe failures of important pods` when appropriate.
…roller descriptor to new style There is a new style upstream for setting up controllers and controller descriptors in this file. Squash with `UPSTREAM: <carry>: Ensure service ca is mounted for projected tokens` when appropriate.
78413b2 to
b6fdf6f
Compare
b6fdf6f to
58f2fd2
Compare
58f2fd2 to
83496b8
Compare
83496b8 to
916de9e
Compare
…if there are none. Squash into `UPSTREAM: <carry>: Add plugin for storage performant security policy` when appropriate.
Squash into tooling (later)
… from kubeadm" This reverts commit 400f8ec. Drop when all OCP infra has removed the usage of the flag.
… from cluster/gce" This reverts commit 784b842. Drop when all OCP infra has removed the usage of the flag.
916de9e to
5fad7c5
Compare
5fad7c5 to
a14e4ae
Compare
|
@jacobsee: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
No description provided.