-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-5532: Proposal for RestartAllContainers action #5556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
228f4eb to
8e236e7
Compare
|
|
||
| ## Motivation | ||
|
|
||
| While KEP-5307 introduces container-level restart policies, there are scenarios where restarting the entire pod is more desirable than restarting a single container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another request we had was from Agones. When the game is too small and can run in seconds, restarting a game in-place is much more efficient than rescheduling.
b20cb15 to
4c346ba
Compare
4c346ba to
a14c4c2
Compare
|
|
||
| ## Summary | ||
|
|
||
| This KEP proposes an extension to the container restart rules introduced in [KEP-5307](https://github.com/kubernetes/enhancements/issues/5307) to allow a container's exit to trigger a restart of the entire pod. This is part of the Pod / Container Restart roadmap planned earlier (see [discussion](https://docs.google.com/document/d/1UmJHJzdmMA1hWwkoP1f3rG9nS0oZ2cRcOx8rO8MsExA/edit?resourcekey=0-OuKspBji_1KJlj2JbnZkgQ&tab=t.0#heading=h.e2gx4umk9f64)). This "in-place" pod restart will terminate and then restart all of the pod's containers (including init and sidecar containers) while preserving the pod's sandbox, UID, network namespace, attached devices, and IP address. This provides a more efficient way to reset a pod's state compared to deleting and recreating the pod, which is particularly beneficial for workloads like AI/ML training, where rescheduling is costly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about locally-managed volumes? Will this wipe and recreate those?
In particular: If I wrote a bunch of stuff to a disk-backed emptyDir volume, will it be preserved? What about a memory-backed (tmpfs) emptyDir ?
I can see arguments in both directions, but I think "restart the pod" also means "restart all the volumes", which may require extending the volume drivers with a "Reset" method or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also what about other durable things like IPC namespace, which might have shared memory allocated - that should probably be torn down and reset?
How about userns mappings which might affect volume mounts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about image pull - if I use :latest or "Always", does it do a re-pull?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msau42 can you advice what else can be broken without remounting volumes?
@rata @giuseppe can you help with user namespaces question?
I think there are two questions about mounts - whether we are breaking something without remount. And whether we want clean up semantically.
I see a lot of benefits in leaving all the volumes as is. Semantically, app owners will just know to clear those up when needed. And it opens the door for state exchange between restarts. It also saves time on restart and aligned with other resources Pod uses. We can opt in volumes for clean up based on some flag.
But it is interesting on what can be broken. My original assumption was that nothing should break as the flow is almost the same as "all containers suddenly crashed and we are restarting them". If we know cases that are broken, we may need to introduce the flag on volumes to require remount of them from the beginning (or force remount based on other things).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about image pull - if I use
:latestor "Always", does it do a re-pull?
The promise kubelet gives is each restart will double check the image. So it may result in re-pull
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SergeyKanzhelev @thockin The userns is created with the pod sandbox. So, this should be completely fine regarding userns and mappings.
Regarding the other points, IMHO I'd start without cleaning up the volumes like emptyDir and so. That can be more involved, I'd only do it in the future if the use-case arises. Those are my 2c :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly I want it to be clearly spelled out without any "what about..." questions left. In my other comment about a diagram, that would answer most of the questions, if it is complete enough
|
|
||
| Many applications rely on init containers to prepare the environment, such as mounting volumes with gcsfuse or performing other setup tasks. When a container fails, a full pod restart ensures that these init containers are re-executed, guaranteeing a clean and correctly configured environment for the new set of application containers. | ||
|
|
||
| Another scenario is when Init container "takes the next item from the queue". And main container exists with the indication that it want's a "new item' to process. See also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a common pattern? There are certainly better ways to do this...
|
|
||
| ### Goals | ||
|
|
||
| - Introduce a `RestartPod` action to the `ContainerRestartRule` API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would REALLY like to see a state-diagram or flowchart showing how restartPolicy and rules impact the lifecycle.
I whipped up https://docs.google.com/presentation/d/17SD8kGB8DrZut4CFg8zPjhD9mRb5HF3H4jS9k5uqaj8/edit?slide=id.p#slide=id.p as an example of what I am looking for. I am not sure about some of the transitions, which is exactly WHY such a diagram is needed. I am sure I am missing details that would be useful e.g. probes).
If you made a sequential list of EVERY step that happens during pod start, I want to know how much gets rewound by this action and how much is retained.
Can I make it a condition of this KEP approval to produce such a diagram and check it in to ... SOMEWHERE useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely reasonable; I made https://docs.google.com/presentation/d/1im23lGh_TgsMKzPMPi_Epm1_TJcU9jRRnZWL-UfEp8U/edit?slide=id.p#slide=id.p to show how the new RestartAllContainers action would change the existing logic. I focused on SyncPod action because it does not affect other parts of the pod lifecycle (e.g. creation / termination). Also, this does not include remounting the volumes.
I'm not sure what's the best way to persists this, I'll link it to the KEP afterwards. Let me know if you have any questions or concerns. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed your state diagram. Maybe I read it wrongly, something related to the init container doesn't look correct to me. Rest of state diagram looks fine to me and thank your for creating it. I am ok to link this to the KEP afterward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comment to the diagram. I see the confusion about init containers. If init containers failed (and not restartable), the pod would be considered Failed, and not enter the runtime.SyncPod cycle. It'll be handled by the termination loop.
a14c4c2 to
d0a62dc
Compare
|
There are no files in this PR? I am assuming you renamed them and forgot to add? |
+1 - I'm not able to review it :) |
d0a62dc to
8d79e56
Compare
|
/lgtm |
|
/cc @dchen1107 |
|
|
||
| **Re-run with Init Containers** | ||
|
|
||
| Many applications rely on init containers to prepare the environment, such as mounting volumes with gcsfuse or performing other setup tasks. When a container fails, a full pod restart ensures that these init containers are re-executed, guaranteeing a clean and correctly configured environment for the new set of application containers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can 'guarantee' a clean state only on assumptions that initContainers are idempotent. This could be possibly true in many cases, but if there are side-effects in the sandbox, this may result in initContainer failures causing stuck pods. Maybe an exitAction api on the onExit rule could help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the terminate actions are in the future plans.
In general init containers have to be reentrant because kubelet never guarantees that it will not rerun them. Furthermore, it is an opt-in feature and there is a way for end user to verify that non-reentrant init container is not used with restartPod
|
We discussed missing remounting capability for the feature at today's SIG Node meeting. I strongly recommend deferring volume cleanup/re-mount and clearly documenting the new container contract. Keeping volumes as-is saves significant restart time and aligns with the concept of a single Pod sandbox. It enables use cases like shared memory for fast state exchange between application container restarts. There might be some caveats without remount capability. And one concern raised with the init container raised. My counter argument is by design, k8s document has always required Init Containers to be idempotent since they are subject to retries on failure and re-execution on a full Pod restart. Let's figure out the clear contract in beta: 1) Idempotency or 2) clean up at the startup time. /lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dchen1107, wojtek-t, yuanwang04 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Uh oh!
There was an error while loading. Please reload this page.