-
Notifications
You must be signed in to change notification settings - Fork 68
feat (postStart) : Allow debugging poststart failures with sleep by trapping errors #1522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat (postStart) : Allow debugging poststart failures with sleep by trapping errors #1522
Conversation
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: rohanKanojia The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
5e1a317 to
bfa0ab7
Compare
|
/ok-to-test |
bfa0ab7 to
ea21eb5
Compare
|
/ok-to-test |
ea21eb5 to
9559169
Compare
|
/ok-to-test |
fd1df4a to
542c5ff
Compare
|
/ok-to-test |
542c5ff to
9853a13
Compare
|
/ok-to-test |
9853a13 to
605efe4
Compare
|
/ok-to-test |
605efe4 to
ff5a0d9
Compare
|
/ok-to-test |
|
For some reasons my workspace is running (tested on OpenShift) oc get dw -A
NAMESPACE NAME DEVWORKSPACE ID PHASE INFO
test failing-poststart-debug-dw workspaced0882b8ed1fc4c69 Running Workspace is running |
|
|
||
| postStartDebugTrapSleepDuration := "" | ||
| if workspace.Annotations[constants.DevWorkspaceDebugStartAnnotation] == "true" { | ||
| postStartDebugTrapSleepDuration = workspace.Config.Workspace.ProgressTimeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I don't like using ProgressTimeout for this purpose.
But on the other hand I don't have another solution but some constant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I use ProgressTimeout to be consistent with the behavior of the Debug annotation when it fails for the main component.
We do not scale down the failing workspace until the failing timeout is satisfied:
devworkspace-operator/controllers/workspace/devworkspace_controller.go
Lines 170 to 173 in b61eaed
| // If debug annotation is present, leave the deployment in place to let users | |
| // view logs. | |
| if workspace.Annotations[constants.DevWorkspaceDebugStartAnnotation] == "true" { | |
| if isTimeout, err := checkForFailingTimeout(workspace); err != nil { |
Inside the checkForFailingTimeout, we're parsing ProgressTimeout:
| timeout, err := time.ParseDuration(workspace.Config.Workspace.ProgressTimeout) |
pkg/library/lifecycle/poststart.go
Outdated
| #!/bin/sh | ||
| %s | ||
| EOF | ||
| chmod +x /tmp/poststart.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rohanKanojia
Were you able to test this snippet?
I am not sure if chmod +x will work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really apologize for my mistake 🙏 . This seems to be a leftover from previous attempts.
I'll remove it.
@tolusha : Could you please share which OCP version you were using? I have tested it on CRC 2.53.0 with OpenShift 4.19.3. |
ff5a0d9 to
67661ea
Compare
|
@tolusha : I've created these videos based on OpenShift 4.20 via clusterbot Scenario 1 : No Poststart Timeout Configureddwo-debug-poststart-normal-scenario.mp4Scenario 2: PostStart Timeout Configureddwo-debug-poststart-poststart-timeout-configured.mp4 |
67661ea to
99ad59e
Compare
|
There is a corner case. |
99ad59e to
58c9221
Compare
|
|
||
| d, err := time.ParseDuration(durationStr) | ||
| if err != nil { | ||
| return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also log the error in this case?
log.Log.Error(err, ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I’ve added the log statement as you advised.
@rohanKanojia it would be great if the INFO can state something similar to |
|
@dkwon17 : You're right about this. It would definitely improve user experience if DevWorkspace However, as you mentioned, detecting this specific state from DevWorkspace Controller may be tricky since the DevWorkspace Controller doesn't have direct visibility into the postStart execution inside the container. During the PostStart hook execution, we inject a When the postStart hook fails, the container enters a sleep state before transitioning to Failed. During this period, the DevWorkspace does not re-enter the reconciliation loop until the sleep duration completes. I haven't dig deep into it but here are some ways we might be able to handle this: Option 1 : Patch DevWorkspace from within the pod via curl:It might also be technically possible for the workspace pod to curl the Kubernetes API directly and patch its own DevWorkspace resource to signal this failure state. However, it depends on how the workspace's ServiceAccount is configured. Option 2 : Surface failure via inspecting container stateIt might be possible to add logic in controller to periodically check DevWorkspaces with
|
58c9221 to
96775f0
Compare
pkg/library/lifecycle/poststart.go
Outdated
| // cd <workingDir> | ||
| // <commandline> | ||
| func processCommandsWithoutTimeoutFallback(commands []dw.Command) (*corev1.LifecycleHandler, error) { | ||
| func processCommandsWithoutTimeoutFallback(postStartDebugTrapSleepDuration string, commands []dw.Command) (*corev1.LifecycleHandler, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| func processCommandsWithoutTimeoutFallback(postStartDebugTrapSleepDuration string, commands []dw.Command) (*corev1.LifecycleHandler, error) { | |
| func processCommandsWithoutTimeoutFallback(commands []dw.Command, postStartDebugTrapSleepDuration string) (*corev1.LifecycleHandler, error) { |
Nitpick, but could we have the new parameter at the end?
pkg/library/lifecycle/poststart.go
Outdated
| // The killAfterDurationSeconds is hardcoded to 5s within this generated script. | ||
| // It conditionally prefixes the user script with the timeout command if available. | ||
| func generateScriptWithTimeout(escapedUserScript string, postStartTimeout string) string { | ||
| func generateScriptWithTimeout(postStartDebugTrapSleepDuration string, escapedUserScript string, postStartTimeout string) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| func generateScriptWithTimeout(postStartDebugTrapSleepDuration string, escapedUserScript string, postStartTimeout string) string { | |
| func generateScriptWithTimeout(escapedUserScript string, postStartTimeout string, postStartDebugTrapSleepDuration string) string { |
| debugTrapLine := strings.ReplaceAll(strings.TrimSpace(debugTrap), "\n", " ") | ||
|
|
||
| dwCommands = append([]string{ | ||
| "set -e", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of the set -e here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set -e makes the script exit immediately if any command fails, so the poststart hook stops on errors instead of continuing silently. This ensures failures are caught and trigger the debug trap properly. I had added it as a safeguard to make sure postStart script fails fast.
I checked if we can remove set -e and tested with this DevWorkspace:
apiVersion: workspace.devfile.io/v1alpha2
kind: DevWorkspace
metadata:
name: dig-fail-debug
annotations:
controller.devfile.io/debug-start: "true"
spec:
started: true
template:
components:
- name: tools
container:
image: quay.io/devfile/universal-developer-image:ubi9-latest
mountSources: false
command: ["tail"]
args: ["-f", "/dev/null"]
commands:
- id: poststart-wrapper
exec:
component: tools
commandLine: |
echo "Start"
wget 'https://wrongexample.com' # should fail
echo "After failure"
events:
postStart:
- poststart-wrapperWhen I tested with a version that had set -e removed, I observed that the echo "After failure" command ran after trap sleep, and Kubernetes treated the hook as successful even though wget failed.
$ oc get pods
NAME READY STATUS RESTARTS AGE
workspacef6c14b383c1240e6-7f98c9ffb7-t6m6g 1/1 Running 0 7m54s
$ kubectl exec -it pod/workspacef6c14b383c1240e6-7f98c9ffb7-t6m6g -- /bin/bash
projects $ cat /tmp/poststart-stderr.txt
--2025-11-05 09:53:47-- https://wrongexample.com/
Resolving wrongexample.com (wrongexample.com)... 172.67.177.31, 104.21.83.138, 2606:4700:3033::6815:538a, ...
Connecting to wrongexample.com (wrongexample.com)|172.67.177.31|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
index.html: Permission denied
Cannot write to 'index.html' (No such file or directory).
projects $ cat /tmp/poststart-stdout.txt
Start
[postStart] failure encountered, sleep for debugging
After failure
projects $ exit
exitWhen I tested with set -e , the pod transitioned to PostStartHookError error after debug trap sleep:
oc get pods
NAME READY STATUS RESTARTS AGE
workspace50b3197366ca4c18-84cfdf596c-8rfz5 0/1 PostStartHookError 0 7m43s…rapping errors Add an optional debug mechanism for postStart lifecycle hooks. When enabled via the `controller.devfile.io/debug-start: "true"` annotation, any failure in a postStart command results in the container sleeping for some seconds as per configured progressTimeout, allowing developers time to inspect the container state. - Added `enableDebugStart` parameter to poststart methods. - Injects `trap ... sleep` into postStart scripts when debug mode is enabled. - Includes support for both timeout-wrapped (`postStartTimeout`) and non-timeout lifecycle scripts. This feature improves debuggability of DevWorkspaces where postStart hooks fail and would otherwise cause container crash/restarts. Signed-off-by: Rohan Kumar <[email protected]>
96775f0 to
a1abac6
Compare
…for debug start When DevWorkspace contains 'controller.devfile.io/debug-start' annotation, set a different message for DevWorkspace Starting phase to give user indication that debug start mode is activated and they need to monitor DevWorkspace pod's logs or exec into it for debugging. Signed-off-by: Rohan Kumar <[email protected]>
|
@dkwon17 : I got some suggestions from Anatolii that instead of using tricky/brittle approach, we can slightly change the message when With new changes, a devworkspace created wth |
|
@rohanKanojia: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What does this PR do?
In #425
controller.devfile.io/debug-startannotation was added to aid in debugging failed devworkspaces: Debugging a failing workspaceWe extend the use case of this annotation so that any failure in a postStart command results in the container sleeping for a specified number of seconds, as per the configured
progressTimeout, allowing developers time to inspect the container state.enableDebugStartparameter to poststart methods.trap ... sleepinto postStart scripts when debug mode is enabled.postStartTimeout) and non-timeout lifecycle scripts.This feature improves the debuggability of DevWorkspaces where postStart hooks fail and would otherwise cause container crashes/restarts.
What issues does this PR fix or reference?
eclipse-che/che#23404
Is it tested? How?
With Changes
make docker && make installContainerCreatingphaseoc get dw NAME DEVWORKSPACE ID PHASE INFO failing-poststart-debug-dw workspace55bf350cfb754260 Starting Waiting for workspace deployment oc get pods NAME READY STATUS RESTARTS AGE workspace55bf350cfb754260-54749bf7c5-288vt 0/1 ContainerCreating 0 10s/tmp/poststart-stderr.txtto see root cause of failure:With PostStartTimeout Enabled
PostStart commands are processed slightly differently when
postStartTimeoutfield is enabled in DevWorkspaceOperatorConfig. You can verify the above flow after enabling it:Without Changes
With the current changes in the main, when we create a DevWorkspace with a failing post-start event. The pod immediately goes into
PostStartHookFailederror and thenCrashLoopbackOfferror. It doesn't allow execution into it to view failure:PR Checklist
/test v8-devworkspace-operator-e2e, v8-che-happy-pathto trigger)v8-devworkspace-operator-e2e: DevWorkspace e2e testv8-che-happy-path: Happy path for verification integration with Che