Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout on probe using container command doesn't work. Will CRI-O fix it. #17340

Open
GrahamDumpleton opened this issue Nov 16, 2017 · 3 comments
Assignees
Labels
component/apps kind/question lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/P2

Comments

@GrahamDumpleton
Copy link

GrahamDumpleton commented Nov 16, 2017

I have been told previously long time back that this issue cannot be fixed so long as docker is used and so it may not have had an issue created for it. Am creating an issue for it now as want to know whether CRI-O will fix it.

The problem is that when you use a container command as a readiness and liveness probe, the timeout value for the probe doesn't work. That is, if the probe hangs, or takes a long time to run, then it will not be failed when the timeout expires, nor will the command for the probe be killed.

In the case of the probe taking a long time to run, once it does return, normal period between probes will then occur and probe will be run again. If however the probe hangs and never returns, it is never marked as failed, nor will subsequent probes ever run. So although the probe is failing, the pod will never be marked as failed and be restarted.

I have been told this problem can't be fixed because docker exec doesn't support a timeout value so that a command can be interrupted.

Is this going to be fixed by CRI-O, or are timeouts on probes when using a container command never going to be supported.

RIght now been advising people to implement their own timeout on probe execution in their probe script, or simply avoid container commands for probe scripts.

Version
oc v3.6.0+c4dd4cf
kubernetes v1.6.1+5115d708d7
features: Basic-Auth

Server https://api.pro-us-east-1.openshift.com:443
openshift v3.6.173.0.21
kubernetes v1.6.1+5115d708d7
Steps To Reproduce

Create a liveness probe which uses a container command and have the command run sleep for a long period.

For example:

oc set probe dc/blog --liveness -- sleep 300
Current Result

Nothing happens after the default one second timeout on the probe. No event indicating failure of probe. You can get into the container and see the first probe is running:

$ ps aux | grep sleep                                                                                                              
1004820+     49  0.0  0.0   5888   612 ?        Ss   01:18   0:00 sleep 300                                                                        
1004820+     99  0.0  0.0  10648   968 ?        S+   01:21   0:00 grep sleep

The process ID for the sleep never changes so is same process and not being killed nor subsequent probe run, at least not until the sleep finishes.

Expected Result

Should register a failure after one second. After two subsequent failures, the pod should be restarted.

Additional Information

None.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 25, 2018
@GrahamDumpleton
Copy link
Author

/lifecycle frozen

@openshift-ci-robot openshift-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Feb 26, 2018
@sjenning
Copy link
Contributor

@mrunalp @runcom any information here?

@sjenning sjenning assigned mrunalp and runcom and unassigned sjenning Feb 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/apps kind/question lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/P2
Projects
None yet
Development

No branches or pull requests

7 participants