Skip to content

Conversation

yuanwang04
Copy link
Contributor

@yuanwang04 yuanwang04 commented Sep 22, 2025

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 22, 2025
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 22, 2025

## Motivation

While KEP-5307 introduces container-level restart policies, there are scenarios where restarting the entire pod is more desirable than restarting a single container.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another request we had was from Agones. When the game is too small and can run in seconds, restarting a game in-place is much more efficient than rescheduling.

@zmerlynn

@yuanwang04 yuanwang04 force-pushed the restart-pod branch 6 times, most recently from b20cb15 to 4c346ba Compare October 2, 2025 22:42

## Summary

This KEP proposes an extension to the container restart rules introduced in [KEP-5307](https://github.com/kubernetes/enhancements/issues/5307) to allow a container's exit to trigger a restart of the entire pod. This is part of the Pod / Container Restart roadmap planned earlier (see [discussion](https://docs.google.com/document/d/1UmJHJzdmMA1hWwkoP1f3rG9nS0oZ2cRcOx8rO8MsExA/edit?resourcekey=0-OuKspBji_1KJlj2JbnZkgQ&tab=t.0#heading=h.e2gx4umk9f64)). This "in-place" pod restart will terminate and then restart all of the pod's containers (including init and sidecar containers) while preserving the pod's sandbox, UID, network namespace, attached devices, and IP address. This provides a more efficient way to reset a pod's state compared to deleting and recreating the pod, which is particularly beneficial for workloads like AI/ML training, where rescheduling is costly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about locally-managed volumes? Will this wipe and recreate those?

In particular: If I wrote a bunch of stuff to a disk-backed emptyDir volume, will it be preserved? What about a memory-backed (tmpfs) emptyDir ?

I can see arguments in both directions, but I think "restart the pod" also means "restart all the volumes", which may require extending the volume drivers with a "Reset" method or something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also what about other durable things like IPC namespace, which might have shared memory allocated - that should probably be torn down and reset?

How about userns mappings which might affect volume mounts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about image pull - if I use :latest or "Always", does it do a re-pull?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msau42 can you advice what else can be broken without remounting volumes?

@rata @giuseppe can you help with user namespaces question?

I think there are two questions about mounts - whether we are breaking something without remount. And whether we want clean up semantically.

I see a lot of benefits in leaving all the volumes as is. Semantically, app owners will just know to clear those up when needed. And it opens the door for state exchange between restarts. It also saves time on restart and aligned with other resources Pod uses. We can opt in volumes for clean up based on some flag.

But it is interesting on what can be broken. My original assumption was that nothing should break as the flow is almost the same as "all containers suddenly crashed and we are restarting them". If we know cases that are broken, we may need to introduce the flag on volumes to require remount of them from the beginning (or force remount based on other things).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about image pull - if I use :latest or "Always", does it do a re-pull?

The promise kubelet gives is each restart will double check the image. So it may result in re-pull

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SergeyKanzhelev @thockin The userns is created with the pod sandbox. So, this should be completely fine regarding userns and mappings.

Regarding the other points, IMHO I'd start without cleaning up the volumes like emptyDir and so. That can be more involved, I'd only do it in the future if the use-case arises. Those are my 2c :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly I want it to be clearly spelled out without any "what about..." questions left. In my other comment about a diagram, that would answer most of the questions, if it is complete enough


Many applications rely on init containers to prepare the environment, such as mounting volumes with gcsfuse or performing other setup tasks. When a container fails, a full pod restart ensures that these init containers are re-executed, guaranteeing a clean and correctly configured environment for the new set of application containers.

Another scenario is when Init container "takes the next item from the queue". And main container exists with the indication that it want's a "new item' to process. See also
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a common pattern? There are certainly better ways to do this...


### Goals

- Introduce a `RestartPod` action to the `ContainerRestartRule` API.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would REALLY like to see a state-diagram or flowchart showing how restartPolicy and rules impact the lifecycle.

I whipped up https://docs.google.com/presentation/d/17SD8kGB8DrZut4CFg8zPjhD9mRb5HF3H4jS9k5uqaj8/edit?slide=id.p#slide=id.p as an example of what I am looking for. I am not sure about some of the transitions, which is exactly WHY such a diagram is needed. I am sure I am missing details that would be useful e.g. probes).

If you made a sequential list of EVERY step that happens during pod start, I want to know how much gets rewound by this action and how much is retained.

Can I make it a condition of this KEP approval to produce such a diagram and check it in to ... SOMEWHERE useful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely reasonable; I made https://docs.google.com/presentation/d/1im23lGh_TgsMKzPMPi_Epm1_TJcU9jRRnZWL-UfEp8U/edit?slide=id.p#slide=id.p to show how the new RestartAllContainers action would change the existing logic. I focused on SyncPod action because it does not affect other parts of the pod lifecycle (e.g. creation / termination). Also, this does not include remounting the volumes.

I'm not sure what's the best way to persists this, I'll link it to the KEP afterwards. Let me know if you have any questions or concerns. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed your state diagram. Maybe I read it wrongly, something related to the init container doesn't look correct to me. Rest of state diagram looks fine to me and thank your for creating it. I am ok to link this to the KEP afterward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comment to the diagram. I see the confusion about init containers. If init containers failed (and not restartable), the pod would be considered Failed, and not enter the runtime.SyncPod cycle. It'll be handled by the termination loop.

@yuanwang04 yuanwang04 changed the title KEP-5532: Proposal for RestartPod action KEP-5532: Proposal for RestartAllContainers action Oct 9, 2025
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 9, 2025
@thockin
Copy link
Member

thockin commented Oct 10, 2025

There are no files in this PR? I am assuming you renamed them and forgot to add?

@wojtek-t
Copy link
Member

There are no files in this PR? I am assuming you renamed them and forgot to add?

+1 - I'm not able to review it :)

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 13, 2025
@wojtek-t
Copy link
Member

/lgtm
/approve PRR

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 14, 2025
@yuanwang04
Copy link
Contributor Author

/cc @dchen1107


**Re-run with Init Containers**

Many applications rely on init containers to prepare the environment, such as mounting volumes with gcsfuse or performing other setup tasks. When a container fails, a full pod restart ensures that these init containers are re-executed, guaranteeing a clean and correctly configured environment for the new set of application containers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can 'guarantee' a clean state only on assumptions that initContainers are idempotent. This could be possibly true in many cases, but if there are side-effects in the sandbox, this may result in initContainer failures causing stuck pods. Maybe an exitAction api on the onExit rule could help?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the terminate actions are in the future plans.

In general init containers have to be reentrant because kubelet never guarantees that it will not rerun them. Furthermore, it is an opt-in feature and there is a way for end user to verify that non-reentrant init container is not used with restartPod

@dchen1107
Copy link
Member

We discussed missing remounting capability for the feature at today's SIG Node meeting. I strongly recommend deferring volume cleanup/re-mount and clearly documenting the new container contract. Keeping volumes as-is saves significant restart time and aligns with the concept of a single Pod sandbox. It enables use cases like shared memory for fast state exchange between application container restarts.

There might be some caveats without remount capability. And one concern raised with the init container raised. My counter argument is by design, k8s document has always required Init Containers to be idempotent since they are subject to retries on failure and re-execution on a full Pod restart.

Let's figure out the clear contract in beta: 1) Idempotency or 2) clean up at the startup time.

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, wojtek-t, yuanwang04

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 14, 2025
@k8s-ci-robot k8s-ci-robot merged commit 4dba1ef into kubernetes:master Oct 14, 2025
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.35 milestone Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants