KEP-5532: Proposal for RestartAllContainers action #5556

yuanwang04 · 2025-09-22T07:16:53Z

One-line PR description: Proposal for RestartPod action

Issue link: Restart all containers on container exits #5532

Discussion: https://docs.google.com/document/d/1OJc1drzMRtUHZNwjW9HhvFUsydByj6CkpiWZ2PZ2buA/edit?tab=t.0#heading=h.fph79ruceywp
Related: Roadmap for container restarts
Related: JobSet in-place restart

keps/sig-node/5532-restart-pod-on-container-exits/README.md

SergeyKanzhelev · 2025-09-23T00:16:00Z

keps/sig-node/5532-restart-pod-on-container-exits/README.md

+
+## Motivation
+
+While KEP-5307 introduces container-level restart policies, there are scenarios where restarting the entire pod is more desirable than restarting a single container.


Another request we had was from Agones. When the game is too small and can run in seconds, restarting a game in-place is much more efficient than rescheduling.

@zmerlynn

keps/sig-node/5532-restart-pod-on-container-exits/README.md

thockin · 2025-10-08T19:52:21Z

keps/sig-node/5532-restart-pod-on-container-exits/README.md

+
+## Summary
+
+This KEP proposes an extension to the container restart rules introduced in [KEP-5307](https://github.com/kubernetes/enhancements/issues/5307) to allow a container's exit to trigger a restart of the entire pod. This is part of the Pod / Container Restart roadmap planned earlier (see [discussion](https://docs.google.com/document/d/1UmJHJzdmMA1hWwkoP1f3rG9nS0oZ2cRcOx8rO8MsExA/edit?resourcekey=0-OuKspBji_1KJlj2JbnZkgQ&tab=t.0#heading=h.e2gx4umk9f64)). This "in-place" pod restart will terminate and then restart all of the pod's containers (including init and sidecar containers) while preserving the pod's sandbox, UID, network namespace, attached devices, and IP address. This provides a more efficient way to reset a pod's state compared to deleting and recreating the pod, which is particularly beneficial for workloads like AI/ML training, where rescheduling is costly.


How about locally-managed volumes? Will this wipe and recreate those?

In particular: If I wrote a bunch of stuff to a disk-backed emptyDir volume, will it be preserved? What about a memory-backed (tmpfs) emptyDir ?

I can see arguments in both directions, but I think "restart the pod" also means "restart all the volumes", which may require extending the volume drivers with a "Reset" method or something.

Also what about other durable things like IPC namespace, which might have shared memory allocated - that should probably be torn down and reset?

How about userns mappings which might affect volume mounts?

How about image pull - if I use :latest or "Always", does it do a re-pull?

@msau42 can you advice what else can be broken without remounting volumes?

@rata @giuseppe can you help with user namespaces question?

I think there are two questions about mounts - whether we are breaking something without remount. And whether we want clean up semantically.

I see a lot of benefits in leaving all the volumes as is. Semantically, app owners will just know to clear those up when needed. And it opens the door for state exchange between restarts. It also saves time on restart and aligned with other resources Pod uses. We can opt in volumes for clean up based on some flag.

But it is interesting on what can be broken. My original assumption was that nothing should break as the flow is almost the same as "all containers suddenly crashed and we are restarting them". If we know cases that are broken, we may need to introduce the flag on volumes to require remount of them from the beginning (or force remount based on other things).

How about image pull - if I use :latest or "Always", does it do a re-pull?

The promise kubelet gives is each restart will double check the image. So it may result in re-pull

@SergeyKanzhelev @thockin The userns is created with the pod sandbox. So, this should be completely fine regarding userns and mappings.

Regarding the other points, IMHO I'd start without cleaning up the volumes like emptyDir and so. That can be more involved, I'd only do it in the future if the use-case arises. Those are my 2c :)

Mostly I want it to be clearly spelled out without any "what about..." questions left. In my other comment about a diagram, that would answer most of the questions, if it is complete enough

thockin · 2025-10-08T20:00:11Z

keps/sig-node/5532-restart-pod-on-container-exits/README.md

+
+Many applications rely on init containers to prepare the environment, such as mounting volumes with gcsfuse or performing other setup tasks. When a container fails, a full pod restart ensures that these init containers are re-executed, guaranteeing a clean and correctly configured environment for the new set of application containers. 
+
+Another scenario is when Init container "takes the next item from the queue". And main container exists with the indication that it want's a "new item' to process. See also


Is this a common pattern? There are certainly better ways to do this...

thockin · 2025-10-08T21:19:04Z

keps/sig-node/5532-restart-pod-on-container-exits/README.md

+
+### Goals
+
+- Introduce a `RestartPod` action to the `ContainerRestartRule` API.


I would REALLY like to see a state-diagram or flowchart showing how restartPolicy and rules impact the lifecycle.

I whipped up https://docs.google.com/presentation/d/17SD8kGB8DrZut4CFg8zPjhD9mRb5HF3H4jS9k5uqaj8/edit?slide=id.p#slide=id.p as an example of what I am looking for. I am not sure about some of the transitions, which is exactly WHY such a diagram is needed. I am sure I am missing details that would be useful e.g. probes).

If you made a sequential list of EVERY step that happens during pod start, I want to know how much gets rewound by this action and how much is retained.

Can I make it a condition of this KEP approval to produce such a diagram and check it in to ... SOMEWHERE useful?

Definitely reasonable; I made https://docs.google.com/presentation/d/1im23lGh_TgsMKzPMPi_Epm1_TJcU9jRRnZWL-UfEp8U/edit?slide=id.p#slide=id.p to show how the new RestartAllContainers action would change the existing logic. I focused on SyncPod action because it does not affect other parts of the pod lifecycle (e.g. creation / termination). Also, this does not include remounting the volumes.

I'm not sure what's the best way to persists this, I'll link it to the KEP afterwards. Let me know if you have any questions or concerns. Thanks!

I reviewed your state diagram. Maybe I read it wrongly, something related to the init container doesn't look correct to me. Rest of state diagram looks fine to me and thank your for creating it. I am ok to link this to the KEP afterward.

Added some comment to the diagram. I see the confusion about init containers. If init containers failed (and not restartable), the pod would be considered Failed, and not enter the runtime.SyncPod cycle. It'll be handled by the termination loop.

keps/sig-node/5532-restart-pod-on-container-exits/README.md

thockin · 2025-10-10T23:30:27Z

There are no files in this PR? I am assuming you renamed them and forgot to add?

wojtek-t · 2025-10-13T09:05:10Z

There are no files in this PR? I am assuming you renamed them and forgot to add?

+1 - I'm not able to review it :)

wojtek-t · 2025-10-14T08:02:09Z

/lgtm
/approve PRR

yuanwang04 · 2025-10-14T17:39:21Z

/cc @dchen1107

ajaysundark · 2025-10-14T18:15:43Z

keps/sig-node/5532-restart-all-containers-on-container-exits/README.md

+
+**Re-run with Init Containers**
+
+Many applications rely on init containers to prepare the environment, such as mounting volumes with gcsfuse or performing other setup tasks. When a container fails, a full pod restart ensures that these init containers are re-executed, guaranteeing a clean and correctly configured environment for the new set of application containers. 


We can 'guarantee' a clean state only on assumptions that initContainers are idempotent. This could be possibly true in many cases, but if there are side-effects in the sandbox, this may result in initContainer failures causing stuck pods. Maybe an exitAction api on the onExit rule could help?

Yes, the terminate actions are in the future plans.

In general init containers have to be reentrant because kubelet never guarantees that it will not rerun them. Furthermore, it is an opt-in feature and there is a way for end user to verify that non-reentrant init container is not used with restartPod

dchen1107 · 2025-10-14T18:29:09Z

We discussed missing remounting capability for the feature at today's SIG Node meeting. I strongly recommend deferring volume cleanup/re-mount and clearly documenting the new container contract. Keeping volumes as-is saves significant restart time and aligns with the concept of a single Pod sandbox. It enables use cases like shared memory for fast state exchange between application container restarts.

There might be some caveats without remount capability. And one concern raised with the init container raised. My counter argument is by design, k8s document has always required Init Containers to be idempotent since they are subject to retries on failure and re-execution on a full Pod restart.

Let's figure out the clear contract in beta: 1) Idempotency or 2) clean up at the startup time.

/lgtm
/approve

k8s-ci-robot · 2025-10-14T18:29:19Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, wojtek-t, yuanwang04

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [wojtek-t]
~~keps/sig-node/OWNERS~~ [dchen1107]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 22, 2025

k8s-ci-robot requested review from mrunalp and palnabarun September 22, 2025 07:17

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 22, 2025

yuanwang04 force-pushed the restart-pod branch from 228f4eb to 8e236e7 Compare September 22, 2025 07:44

SergeyKanzhelev reviewed Sep 23, 2025

View reviewed changes