Skip to content

Conversation

@jacobsee
Copy link
Member

No description provided.

k8s-ci-robot and others added 30 commits October 31, 2025 10:10
…-infra-container-image

Remove deprecated pod-infra-container-image flag
…ner-status

kubelet: add nil check for ContainerStatus in GetContainerStatus
This raises the number of allowed taints per device to 16 by lowering
the number of allowed devices to 64 per ResourceSlice if (and only if!)
taints are used.

"effect: None" and DeviceTaintRule status with conditions get added
to support giving feedback to admins.

Instead of merely adding the new effect value, this also changes validation of
the enum so that unknown values are valid if they were already stored. This
will simplify adding new effects in the future because validation won't fail
for them after a downgrade. Consumers must treat them like this new None
effect, i.e. ignore them.
When the ResourceSlice no longer exists, the ResourceSlice tracker didn't and
couldn't report the tainted devices even if they are allocated and in use. The
controller must keep track of DeviceTaintRules itself and handle this scenario.

In this scenario it is impossible to evaluation CEL expressions because the
necessary device attributes aren't available. We could:
- Copy them in the allocation result: too large, big change.
- Limit usage of CEL expressions to rules with no eviction: inconsistent.
- Remove the fields which cannot be supported well.

The last option is chosen.

The tracker is now no longer needed by the eviction controller. Reading
directly from the informer means that we cannot assume that pointers are
consistent. We have to track ResourceSlices by their name, not their pointer.
While at it, ensure that future unknown effects are treating like
the None effect.
We know how often the controller should get a pod, let's check it.
Must run before we do our own GET call.
The approach copied from node taint eviction was to fire off one goroutine per
pod the intended time. This leads to the "thundering herd" problem: when a
single taint causes eviction of several pods and those all have no or the same
toleration grace period, then they all get deleted concurrently at the same
time.

For node taint eviction that is limited by the number of pods per node, which
is typically ~100. In an integration test, that already led to problems with
watchers:

   cacher.go:855] cacher (pods): 100 objects queued in incoming channel.
   cache_watcher.go:203] Forcing pods watcher close due to unresponsiveness: key: "/pods/", labels: "", fields: "". len(c.input) = 10, len(c.result) = 10, graceful = false

It also causes spikes in memory consumption (mostly the 2KB stack per goroutine
plus closure) with no upper limit.

Using a workqueue makes concurrency more deterministic because there is an
upper limit. In the integration test, 10 workers kept the watch active.

Another advantage is that failures to evict the pod get retried with
exponential backoff per affected pod forever. Previously, evicting was tried a
few times with a fixed rate and then the controller gave up. If the apiserver
was down long enough, pods didn't get evicted.
Signed-off-by: Swati Sehgal <[email protected]>
this feature gate was meant to be ephemeral, and only was used for guaranteeing a
cluster admin didn't accidentally relax PSA policies before the kubelet would deny a pod
was created if it didn't support user namespaces. As of kube 1.33, the supported apiserver version
skew of n-3 guarantees that all supported kubelets are of 1.30 or later, meaning they do this.

Now, we can unconditionally relax PSA policy if a pod is in a user namespace.

This PR reserves older policies default behavior by never relaxing

Signed-off-by: Peter Hunt <[email protected]>
As well as fix procMount restricted test
[InPlacePodVerticalScaling] Reformat a couple of e2e tests
…gconditions-flake

 test: fix flake in DRA DeviceBindingCondition
…wks_metric_comment

oidc: fix jwks metric name in comment
Signed-off-by: Omer Aplatony <[email protected]>
Specifically the new AddTreeConstructionNodeArgsTransformer and SpecPriority in
Ginkgo will be useful.

Gomega gets updated to keep up-to-date.
This avoids the risk of having a slow test started towards the end of a run,
which then would cause the run to take longer. When started early they can run
in parallel to other tests. In serial runs it doesn't matter.

The implementation maps the Slow label to the new ginkgo.SpecPriority. The
default is 0. Tests with priority 1 run first.
E2E: run slow tests first, using new Ginkgo
…raint

[DRA] Fix DistinctAttributeConstraint match comparision with value
As described in https://go.dev/issue/74633 a misbehaving proxy server
can cause memory exhaustion in the client. The fix in https://go.dev/cl/698915
only covers http.Transport users. Apply the same limit here.

Limit the size of the response headers the proxy server can send us.
This reverts commit cff07e7.

The commit caused several kubeadm jobs to fail while executing all conformance
tests (including slow ones) in parallel. Sometimes execution took longer and
ran into the overall timeout, sometimes there was:

    [FAILED] Expected
        <int>: 440
    to be ==
        <int>: 400
    In [It] at: k8s.io/kubernetes/test/e2e/apimachinery/chunking.go:202

It looks like the tests are flaky and/or reveal a real bug when slow tests run
all in parallel at the same time.

This should work, but doesn't right now, so let's revert until that problem is fixed.
ardaguclu and others added 8 commits November 19, 2025 20:58
Temporarily disable volume group snapshot (VGS) tests. We need to rebase the
external-snapshotter that updates VGS API in an incompatible way (v1beta1
-> v1beta2 without automatic conversion). The API is TechPreviewNoUpgrade,
so its breakage won't affect any customer.

We need to update all components that provide or use the API while the
tests are off, they would fail spectacularly.
… is available in the already-imported fscommon package

Squash into `UPSTREAM: <carry>: disable load balancing on created cgroups when managed is enabled` during review phases of rebase.
This appears to have been missed in a previous `UPSTREAM: <carry>: remove annotation framework in favor of environment selectors` commit (PR# 2393). It should be squashed during the next rebase.
…EnableResourceSizeEstimation

Squash into `UPSTREAM: <carry>: retry etcd Unavailable errors` when appropriate during the rebase cycle.
Squash into `UPSTREAM: <carry>: provide events, messages, and bodies for probe failures of important pods` when appropriate.
…roller descriptor to new style

There is a new style upstream for setting up controllers and controller descriptors in this file. Squash with `UPSTREAM: <carry>: Ensure service ca is mounted for projected tokens` when appropriate.
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 20, 2025
@openshift-ci-robot
Copy link

@jacobsee: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@openshift-ci-robot
Copy link

@jacobsee: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@openshift-ci-robot
Copy link

@jacobsee: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@openshift-ci-robot
Copy link

@jacobsee: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

…if there are none.

Squash into `UPSTREAM: <carry>: Add plugin for storage performant security policy` when appropriate.
Squash into tooling (later)
… from kubeadm"

This reverts commit 400f8ec.
Drop when all OCP infra has removed the usage of the flag.
… from cluster/gce"

This reverts commit 784b842.
Drop when all OCP infra has removed the usage of the flag.
@openshift-ci-robot
Copy link

@jacobsee: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@openshift-ci-robot
Copy link

@jacobsee: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@openshift-ci
Copy link

openshift-ci bot commented Nov 22, 2025

@jacobsee: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-serial-1of2 a14e4ae link true /test e2e-aws-ovn-serial-1of2
ci/prow/e2e-aws-ovn-crun a14e4ae link true /test e2e-aws-ovn-crun
ci/prow/e2e-aws-ovn-techpreview-serial-2of2 a14e4ae link false /test e2e-aws-ovn-techpreview-serial-2of2
ci/prow/e2e-aws-crun-wasm a14e4ae link true /test e2e-aws-crun-wasm
ci/prow/e2e-aws-ovn-fips a14e4ae link true /test e2e-aws-ovn-fips
ci/prow/e2e-aws-ovn-hypershift a14e4ae link true /test e2e-aws-ovn-hypershift
ci/prow/e2e-gcp a14e4ae link true /test e2e-gcp
ci/prow/e2e-aws-ovn-techpreview-serial-1of2 a14e4ae link false /test e2e-aws-ovn-techpreview-serial-1of2
ci/prow/e2e-aws-ovn-techpreview a14e4ae link false /test e2e-aws-ovn-techpreview
ci/prow/k8s-e2e-conformance-aws a14e4ae link true /test k8s-e2e-conformance-aws
ci/prow/k8s-e2e-gcp-ovn a14e4ae link true /test k8s-e2e-gcp-ovn
ci/prow/e2e-metal-ipi-ovn-ipv6 a14e4ae link true /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-aws-ovn-upgrade a14e4ae link true /test e2e-aws-ovn-upgrade
ci/prow/e2e-aws-ovn-runc a14e4ae link true /test e2e-aws-ovn-runc
ci/prow/e2e-azure-ovn-upgrade a14e4ae link true /test e2e-azure-ovn-upgrade
ci/prow/unit a14e4ae link true /test unit
ci/prow/e2e-aws-ovn-cgroupsv2 a14e4ae link true /test e2e-aws-ovn-cgroupsv2
ci/prow/e2e-aws-ovn-serial-2of2 a14e4ae link true /test e2e-aws-ovn-serial-2of2
ci/prow/e2e-aws-ovn-downgrade a14e4ae link true /test e2e-aws-ovn-downgrade
ci/prow/k8s-e2e-gcp-serial a14e4ae link true /test k8s-e2e-gcp-serial

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backports/unvalidated-commits Indicates that not all commits come to merged upstream PRs. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. vendor-update Touching vendor dir or related files

Projects

None yet

Development

Successfully merging this pull request may close these issues.