Skip to content

Conversation

@rohanKanojia
Copy link
Member

@rohanKanojia rohanKanojia commented Oct 8, 2025

What does this PR do?

In #425 controller.devfile.io/debug-start annotation was added to aid in debugging failed devworkspaces: Debugging a failing workspace

We extend the use case of this annotation so that any failure in a postStart command results in the container sleeping for a specified number of seconds, as per the configured progressTimeout, allowing developers time to inspect the container state.

  • Added enableDebugStart parameter to poststart methods.
  • Injects trap ... sleep into postStart scripts when debug mode is enabled.
  • Includes support for both timeout-wrapped (postStartTimeout) and non-timeout lifecycle scripts.

This feature improves the debuggability of DevWorkspaces where postStart hooks fail and would otherwise cause container crashes/restarts.

What issues does this PR fix or reference?

eclipse-che/che#23404

Is it tested? How?

With Changes

  1. Checkout code changes added in this PR
  2. Deploy DevWorkspace Operator Kubernetes/OpenShift cluster make docker && make install
  3. Create DevWorkspace that has a failing poststart command
oc apply -f - <<EOF
apiVersion: workspace.devfile.io/v1alpha2
kind: DevWorkspace
metadata:
  name: failing-poststart-debug-dw
  annotations:
    controller.devfile.io/debug-start: "true"
spec:
  started: true
  template:
    components:
      - name: tools
        container:
          image: quay.io/wto/web-terminal-tooling:next
          sourceMapping: /projects
          command: [ "tail" ]
          args: [ "-f", "/dev/null" ]
    commands:
      - id: failing-command
        exec:
          commandLine: ls idontexist
          component: tools
    events:
      postStart:
        - failing-command
EOF
  1. After creating the DevWorkspace, observe its pod status. It should stay in ContainerCreating phase
oc get dw                                                                           
NAME                         DEVWORKSPACE ID             PHASE      INFO
failing-poststart-debug-dw   workspace55bf350cfb754260   Starting   Waiting for workspace deployment
oc get pods                                                                          
NAME                                         READY   STATUS              RESTARTS   AGE
workspace55bf350cfb754260-54749bf7c5-288vt   0/1     ContainerCreating   0          10s
  1. You should be able to exec into the pod and see /tmp/poststart-stderr.txt to see root cause of failure:
oc get pods                                                                         
NAME                                         READY   STATUS              RESTARTS   AGE
workspace55bf350cfb754260-54749bf7c5-288vt   0/1     ContainerCreating   0          14s
kubectl exec -it workspace55bf350cfb754260-54749bf7c5-288vt -- /bin/bash            
bash-4.4$ cat /tmp/poststart-stderr.txt 
ls: cannot access 'idontexist': No such file or directory
  1. Verify the sleep process is active in the container:
ps -ax | grep sleep
      2 ?        Ss     0:00 /bin/sh -c { cat << 'EOF' > /tmp/poststart.sh #!/bin/sh set -e trap 'echo "[postStart] failure encountered, sleep for debugging"; sleep 3600' ERR ls idontexist EOF chmod +x /tmp/poststart.sh /tmp/poststart.sh  } 1>/tmp/poststart-stdout.txt 2>/tmp/poststart-stderr.txt 
      7 ?        S      0:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 3600
     19 pts/0    S+     0:00 grep sleep
With PostStartTimeout Enabled

PostStart commands are processed slightly differently when postStartTimeout field is enabled in DevWorkspaceOperatorConfig. You can verify the above flow after enabling it:

oc patch devworkspaceoperatorconfig devworkspace-operator-config -n openshift-operators --type=merge -p '{"config": {"workspace": {"postStartTimeout": "5m"}}}'

## Repeat steps 3-6

Without Changes

With the current changes in the main, when we create a DevWorkspace with a failing post-start event. The pod immediately goes into PostStartHookFailed error and then CrashLoopbackOff error. It doesn't allow execution into it to view failure:

# Create DevWorkspace with failing poststart
oc apply -f - <<EOF
apiVersion: workspace.devfile.io/v1alpha2
kind: DevWorkspace
metadata:
  name: failing-poststart-debug-dw
  annotations:
    controller.devfile.io/debug-start: "true"
spec:
  started: true
  template:
    components:
      - name: tools
        container:
          image: quay.io/wto/web-terminal-tooling:next
          sourceMapping: /projects
          command: [ "tail" ]
          args: [ "-f", "/dev/null" ]
    commands:
      - id: failing-command
        exec:
          commandLine: ls idontexist
          component: tools
    events:
      postStart:
        - failing-command
EOF

oc get dw                                                                            
NAME                         DEVWORKSPACE ID             PHASE     INFO
failing-poststart-debug-dw   workspace7ad9a94285b94f7c   Failing   Error creating DevWorkspace deployment: Container tools has state [postStart hook] Commands failed (Kubelet reported exit code 2)

oc get pods                                                                          
NAME                                         READY   STATUS             RESTARTS      AGE
workspace7ad9a94285b94f7c-579896cc48-wmtrj   0/1     CrashLoopBackOff   1 (20s ago)   50s
kubectl exec -it workspace7ad9a94285b94f7c-579896cc48-wmtrj -- /bin/bash             
error: unable to upgrade connection: container not found ("tools")

PR Checklist

  • E2E tests pass (when PR is ready, comment /test v8-devworkspace-operator-e2e, v8-che-happy-path to trigger)
    • v8-devworkspace-operator-e2e: DevWorkspace e2e test
    • v8-che-happy-path: Happy path for verification integration with Che

@openshift-ci
Copy link

openshift-ci bot commented Oct 8, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link

openshift-ci bot commented Oct 8, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rohanKanojia
Once this PR has been reviewed and has the lgtm label, please assign dkwon17 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rohanKanojia rohanKanojia force-pushed the pr/debug-poststart-via-trap branch 3 times, most recently from 5e1a317 to bfa0ab7 Compare October 8, 2025 18:44
@rohanKanojia
Copy link
Member Author

/ok-to-test

@rohanKanojia rohanKanojia force-pushed the pr/debug-poststart-via-trap branch from bfa0ab7 to ea21eb5 Compare October 9, 2025 03:50
@rohanKanojia
Copy link
Member Author

/ok-to-test

@rohanKanojia rohanKanojia force-pushed the pr/debug-poststart-via-trap branch from ea21eb5 to 9559169 Compare October 9, 2025 09:22
@rohanKanojia
Copy link
Member Author

/ok-to-test

@rohanKanojia
Copy link
Member Author

/ok-to-test

@rohanKanojia rohanKanojia force-pushed the pr/debug-poststart-via-trap branch from 542c5ff to 9853a13 Compare October 16, 2025 09:14
@rohanKanojia
Copy link
Member Author

/ok-to-test

@rohanKanojia rohanKanojia force-pushed the pr/debug-poststart-via-trap branch from 9853a13 to 605efe4 Compare October 16, 2025 11:52
@rohanKanojia
Copy link
Member Author

/ok-to-test

@rohanKanojia rohanKanojia force-pushed the pr/debug-poststart-via-trap branch from 605efe4 to ff5a0d9 Compare October 16, 2025 15:37
@rohanKanojia
Copy link
Member Author

/ok-to-test

@rohanKanojia rohanKanojia marked this pull request as ready for review October 16, 2025 15:54
@tolusha
Copy link
Contributor

tolusha commented Oct 23, 2025

For some reasons my workspace is running (tested on OpenShift)

oc get dw    -A
NAMESPACE   NAME                         DEVWORKSPACE ID             PHASE     INFO
test        failing-poststart-debug-dw   workspaced0882b8ed1fc4c69   Running   Workspace is running


postStartDebugTrapSleepDuration := ""
if workspace.Annotations[constants.DevWorkspaceDebugStartAnnotation] == "true" {
postStartDebugTrapSleepDuration = workspace.Config.Workspace.ProgressTimeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I don't like using ProgressTimeout for this purpose.
But on the other hand I don't have another solution but some constant

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I use ProgressTimeout to be consistent with the behavior of the Debug annotation when it fails for the main component.

We do not scale down the failing workspace until the failing timeout is satisfied:

// If debug annotation is present, leave the deployment in place to let users
// view logs.
if workspace.Annotations[constants.DevWorkspaceDebugStartAnnotation] == "true" {
if isTimeout, err := checkForFailingTimeout(workspace); err != nil {

Inside the checkForFailingTimeout, we're parsing ProgressTimeout:

timeout, err := time.ParseDuration(workspace.Config.Workspace.ProgressTimeout)

#!/bin/sh
%s
EOF
chmod +x /tmp/poststart.sh
Copy link
Contributor

@tolusha tolusha Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rohanKanojia
Were you able to test this snippet?
I am not sure if chmod +x will work

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really apologize for my mistake 🙏 . This seems to be a leftover from previous attempts.

I'll remove it.

@rohanKanojia
Copy link
Member Author

For some reasons my workspace is running (tested on OpenShift)

@tolusha : Could you please share which OCP version you were using? I have tested it on CRC 2.53.0 with OpenShift 4.19.3.

@rohanKanojia rohanKanojia force-pushed the pr/debug-poststart-via-trap branch from ff5a0d9 to 67661ea Compare October 25, 2025 10:45
@rohanKanojia
Copy link
Member Author

@tolusha : I've created these videos based on OpenShift 4.20 via clusterbot

Scenario 1 : No Poststart Timeout Configured

dwo-debug-poststart-normal-scenario.mp4

Scenario 2: PostStart Timeout Configured

dwo-debug-poststart-poststart-timeout-configured.mp4

@rohanKanojia rohanKanojia force-pushed the pr/debug-poststart-via-trap branch from 67661ea to 99ad59e Compare October 25, 2025 11:43
@tolusha
Copy link
Contributor

tolusha commented Oct 27, 2025

There is a corner case.
When trap already exists, then added one is ignored.
I think we can keep as is.

@rohanKanojia rohanKanojia force-pushed the pr/debug-poststart-via-trap branch from 99ad59e to 58c9221 Compare October 29, 2025 04:27

d, err := time.ParseDuration(durationStr)
if err != nil {
return 0
Copy link
Collaborator

@dkwon17 dkwon17 Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also log the error in this case?

log.Log.Error(err, ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I’ve added the log statement as you advised.

@dkwon17
Copy link
Collaborator

dkwon17 commented Nov 3, 2025

oc get dw                                                                           
NAME                         DEVWORKSPACE ID             PHASE      INFO
failing-poststart-debug-dw   workspace55bf350cfb754260   Starting   Waiting for workspace deployment

@rohanKanojia it would be great if the INFO can state something similar to Post start hook failed, sleeping for <progressTimeout> to let the user know that the failure has been detected, but I think this is a bit difficult to detect, WDYT?

@rohanKanojia
Copy link
Member Author

@dkwon17 : You're right about this. It would definitely improve user experience if DevWorkspace INFO field indicated that postStart hook failed and the container is sleeping for debugging.

However, as you mentioned, detecting this specific state from DevWorkspace Controller may be tricky since the DevWorkspace Controller doesn't have direct visibility into the postStart execution inside the container.

During the PostStart hook execution, we inject a sleep between the following DevWorkspace states:

(Starting) ---> (if postStart hook failed, inject sleep) ---> (Failed)

When the postStart hook fails, the container enters a sleep state before transitioning to Failed. During this period, the DevWorkspace does not re-enter the reconciliation loop until the sleep duration completes.

I haven't dig deep into it but here are some ways we might be able to handle this:

Option 1 : Patch DevWorkspace from within the pod via curl:

It might also be technically possible for the workspace pod to curl the Kubernetes API directly and patch its own DevWorkspace resource to signal this failure state. However, it depends on how the workspace's ServiceAccount is configured.

⚠️ I'm not sure standard DevWorkspaces would have the necessary RBAC permissions to patch or update their own DevWorkspace resource.

Option 2 : Surface failure via inspecting container state

It might be possible to add logic in controller to periodically check DevWorkspaces with debug-start annotation

  • For each workspace with annotation, the controller would inspect associated pod.
  • if exec is available and a sleep process injected by postStart hook is detected, the controller can immediately update workspace status to indicate that the postStart hook failed and container is sleeping for debugging.

@rohanKanojia rohanKanojia force-pushed the pr/debug-poststart-via-trap branch from 58c9221 to 96775f0 Compare November 4, 2025 13:43
// cd <workingDir>
// <commandline>
func processCommandsWithoutTimeoutFallback(commands []dw.Command) (*corev1.LifecycleHandler, error) {
func processCommandsWithoutTimeoutFallback(postStartDebugTrapSleepDuration string, commands []dw.Command) (*corev1.LifecycleHandler, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func processCommandsWithoutTimeoutFallback(postStartDebugTrapSleepDuration string, commands []dw.Command) (*corev1.LifecycleHandler, error) {
func processCommandsWithoutTimeoutFallback(commands []dw.Command, postStartDebugTrapSleepDuration string) (*corev1.LifecycleHandler, error) {

Nitpick, but could we have the new parameter at the end?

// The killAfterDurationSeconds is hardcoded to 5s within this generated script.
// It conditionally prefixes the user script with the timeout command if available.
func generateScriptWithTimeout(escapedUserScript string, postStartTimeout string) string {
func generateScriptWithTimeout(postStartDebugTrapSleepDuration string, escapedUserScript string, postStartTimeout string) string {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func generateScriptWithTimeout(postStartDebugTrapSleepDuration string, escapedUserScript string, postStartTimeout string) string {
func generateScriptWithTimeout(escapedUserScript string, postStartTimeout string, postStartDebugTrapSleepDuration string) string {

debugTrapLine := strings.ReplaceAll(strings.TrimSpace(debugTrap), "\n", " ")

dwCommands = append([]string{
"set -e",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of the set -e here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set -e makes the script exit immediately if any command fails, so the poststart hook stops on errors instead of continuing silently. This ensures failures are caught and trigger the debug trap properly. I had added it as a safeguard to make sure postStart script fails fast.

I checked if we can remove set -e and tested with this DevWorkspace:

apiVersion: workspace.devfile.io/v1alpha2
kind: DevWorkspace
metadata:
  name: dig-fail-debug
  annotations:
    controller.devfile.io/debug-start: "true"
spec:
  started: true
  template:
    components:
      - name: tools
        container:
          image: quay.io/devfile/universal-developer-image:ubi9-latest
          mountSources: false
          command: ["tail"]
          args: ["-f", "/dev/null"]
    commands:
      - id: poststart-wrapper
        exec:
          component: tools
          commandLine: |
            echo "Start"
            wget 'https://wrongexample.com'  # should fail 
            echo "After failure"
    events:
      postStart:
        - poststart-wrapper

When I tested with a version that had set -e removed, I observed that the echo "After failure" command ran after trap sleep, and Kubernetes treated the hook as successful even though wget failed.

$ oc get pods
NAME                                         READY   STATUS    RESTARTS   AGE
workspacef6c14b383c1240e6-7f98c9ffb7-t6m6g   1/1     Running   0          7m54s
$ kubectl exec -it pod/workspacef6c14b383c1240e6-7f98c9ffb7-t6m6g -- /bin/bash
projects $ cat /tmp/poststart-stderr.txt
--2025-11-05 09:53:47--  https://wrongexample.com/
Resolving wrongexample.com (wrongexample.com)... 172.67.177.31, 104.21.83.138, 2606:4700:3033::6815:538a, ...
Connecting to wrongexample.com (wrongexample.com)|172.67.177.31|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
index.html: Permission denied

Cannot write to 'index.html' (No such file or directory).
projects $ cat /tmp/poststart-stdout.txt
Start
[postStart] failure encountered, sleep for debugging
After failure
projects $ exit
exit

When I tested with set -e , the pod transitioned to PostStartHookError error after debug trap sleep:

oc get pods
NAME                                         READY   STATUS               RESTARTS   AGE
workspace50b3197366ca4c18-84cfdf596c-8rfz5   0/1     PostStartHookError   0          7m43s

…rapping errors

Add an optional debug mechanism for postStart lifecycle hooks. When enabled via the
`controller.devfile.io/debug-start: "true"` annotation, any failure in a postStart command results in the container sleeping for some seconds as per configured progressTimeout, allowing developers time to inspect the container state.

- Added `enableDebugStart` parameter to poststart methods.
- Injects `trap ... sleep` into postStart scripts when debug mode is enabled.
- Includes support for both timeout-wrapped (`postStartTimeout`) and non-timeout lifecycle scripts.

This feature improves debuggability of DevWorkspaces where postStart hooks fail and would otherwise cause container crash/restarts.

Signed-off-by: Rohan Kumar <[email protected]>
@rohanKanojia rohanKanojia force-pushed the pr/debug-poststart-via-trap branch from 96775f0 to a1abac6 Compare November 5, 2025 09:15
…for debug start

When DevWorkspace contains 'controller.devfile.io/debug-start' annotation,
set a different message for DevWorkspace Starting phase to give user indication
that debug start mode is activated and they need to monitor DevWorkspace pod's
logs or exec into it for debugging.

Signed-off-by: Rohan Kumar <[email protected]>
@rohanKanojia
Copy link
Member Author

rohanKanojia commented Nov 6, 2025

@dkwon17 : I got some suggestions from Anatolii that instead of using tricky/brittle approach, we can slightly change the message when debug-start annotation is enabled.

With new changes, a devworkspace created wth debug-start annotation would have status like this:

NAMESPACE             NAME                    DEVWORKSPACE ID             PHASE      INFO
openshift-operators   failing-post-start-ws   workspace6734b18d651d486d   Starting   Debug mode: failed postStart commands will be trapped; inspect logs/exec to debug

@openshift-ci
Copy link

openshift-ci bot commented Nov 6, 2025

@rohanKanojia: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/v14-devworkspace-operator-e2e 3a96b2b link true /test v14-devworkspace-operator-e2e

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants