Sync CNV runbook VirtualMachineStuckInUnhealthyState.md (Updated at 2026-02-03 12:16:09 +0000 UTC) #399

New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
hco-bot wants to merge 1 commit into openshift:master from hco-bot:cnv-runbook-sync-20260203121609/VirtualMachineStuckInUnhealthyState
alerts/openshift-virtualization-operator/VirtualMachineStuckInUnhealthyState.md
            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -2,7 +2,7 @@
  
    ## Meaning

    This alert triggers when a VirtualMachine (VM) has been in an unhealthy state

    This alert triggers when a virtual machine (VM) is in an unhealthy state

    for more than 10 minutes and does not have an associated VMI

    (VirtualMachineInstance).

    @@ -12,128 +12,128 @@ initial phases of VM startup when OpenShift Virtualization is trying to
  
    provision resources,

    pull images, or schedule the workload.

    **Affected States:**

    - `Provisioning` - Resources (DataVolumes, PVCs) are being prepared

    **Affected states:**

    - `Provisioning` - The VM is preparing resources (DataVolumes, PVCs)

    - `Starting` - VM is attempting to start but no VMI exists yet

    - `Terminating` - VM is being deleted but without an active VMI

    - `Error` states - Various scheduling, image, or resource allocation errors

    - `Error` - Various scheduling, image, or resource allocation errors

    ## Impact

    - **Severity:** Warning

    - **User Impact:** VMs cannot start or are stuck in error states

    - **Business Impact:** Workloads cannot be deployed, which affects application

    - **User impact:** VMs cannot start or are stuck in error states

    - **Business impact:** Workloads cannot be deployed, which affects application

      availability

    ## Possible Causes

    ## Possible causes

    ### Resource-Related Issues

    ### Resource-related issues

    - Insufficient cluster resources (CPU, memory, storage)

    - Missing or misconfigured storage classes

    - PVC provisioning failures

    - DataVolume creation/import failures

    ### Image and Registry Issues

    ### Image and registry issues

    - Container image pull failures for containerDisk volumes

    - Registry authentication problems

    - Network connectivity issues to image registries

    - Missing or corrupted VM disk images

    ### Scheduling and Node Issues

    ### Scheduling and node issues

    - No schedulable nodes available (all nodes cordoned/unschedulable)

    - Insufficient resources like KVM/GPU on available nodes

    - A mismatch between requested and available CPU models

    - Node selector constraints cannot be satisfied

    - Taints and tolerations preventing scheduling

    ### Configuration Issues

    ### Configuration issues

    - Invalid VM specifications (malformed YAML, unsupported features)

    - Missing required Secrets or ConfigMaps

    - Incorrect resource requests/limits

    - Network configuration errors

    ## Diagnosis

    ### 1. Check VM Status and Events

    ```bash

    # Get VM details and status

    oc get vm <vm-name> -n <namespace> -o yaml

    1. Check VM status and events

        ```bash

        # Get VM details and status

        $ oc get vm <vm-name> -n <namespace> -o yaml

    # Check VM events for error messages

    oc describe vm <vm-name> -n <namespace>

        # Check VM events for error messages

        $ oc describe vm <vm-name> -n <namespace>

    # Look for related events in the namespace

    oc get events -n <namespace> --sort-by='.lastTimestamp'

    ```

        # Look for related events in the namespace

        $ oc get events -n <namespace> --sort-by='.lastTimestamp'

        ```

    ### 2. Verify Resource Availability

    ```bash

    # Check node resources and schedulability

    oc get nodes -o wide

    oc describe nodes

    2. Verify resource availability

        ```bash

        # Check node resources and schedulability

        $ oc get nodes -o wide

        $ oc describe nodes

    # Check storage classes and provisioners

    oc get storageclass

    oc get pv,pvc -n <namespace>

        # Check storage classes and provisioners

        $ oc get storageclass

        $ oc get pv,pvc -n <namespace>

    # For DataVolumes (if using)

    oc get datavolume -n <namespace>

    oc describe datavolume <dv-name> -n <namespace>

    ```

        # For DataVolumes (if using)

        $ oc get datavolume -n <namespace>

        $ oc describe datavolume <dv-name> -n <namespace>

        ```

    ### 3. Check Image Availability (for containerDisk)

    ```bash

    # If using containerDisk, verify image accessibility from the affected node

    # Start a debug session on the node hosting the VM (or a representative node)

    oc debug node/<node-name> -it --image=busybox

    3. Check image availability (for containerDisk)

        ```bash

        # If using containerDisk, verify image accessibility from the affected node

        # Start a debug session on the node hosting the VM (or a representative node)

        $ oc debug node/<node-name> -it --image=busybox

    # Inside the debug pod, check which container runtime is used

    ps aux | grep -E "(containerd|dockerd|crio)"

        # Inside the debug pod, check which container runtime is used

        $ ps aux | grep -E "(containerd|dockerd|crio)"

    # For CRI-O/containerd clusters use crictl to pull the image

    crictl pull <vm-disk-image>

        # For CRI-O/containerd clusters use crictl to pull the image

        $ crictl pull <vm-disk-image>

    # For Docker-based clusters (less common)

    docker pull <vm-disk-image>

        # For Docker-based clusters (less common)

        $ docker pull <vm-disk-image>

    # Exit the debug session when done

    exit

        # Exit the debug session when done

        $ exit

    # Check image pull secrets if required

    oc get secrets -n <namespace>

    ```

        # Check image pull secrets if required

        $ oc get secrets -n <namespace>

        ```

    ### 4. Verify KubeVirt Configuration

    ```bash

    # Discover the KubeVirt installation namespace

    export NAMESPACE="$(oc get kubevirt -A -o custom-columns="":.metadata.namespace)"

    4. Verify OpenShift Virtualization configuration

        ```bash

        # Discover the KubeVirt installation namespace

        $ export NAMESPACE="$(oc get kubevirt -A -o custom-columns="":.metadata.namespace)"

    # Check KubeVirt CR conditions (expect Available=True)

    oc get kubevirt -n "$NAMESPACE" \

      -o jsonpath='{range .items[*].status.conditions[*]}{.type}={.status}{"\n"}{end}'

        # Check KubeVirt CR conditions (expect Available=True)

        $ oc get kubevirt -n "$NAMESPACE" \

          -o jsonpath='{range .items[*].status.conditions[*]}{.type}={.status}{"\n"}{end}'

    # Or check a single CR named 'kubevirt'

    oc get kubevirt kubevirt -n "$NAMESPACE" \

      -o jsonpath='{.status.conditions[?(@.type=="Available")].status}'

        # Or check a single CR named 'kubevirt'

        $ oc get kubevirt kubevirt -n "$NAMESPACE" \

          -o jsonpath='{.status.conditions[?(@.type=="Available")].status}'

    # Verify virt-controller is running

    oc get pods -n "$NAMESPACE" \

      -l kubevirt.io=virt-controller

        # Verify virt-controller is running

        $ oc get pods -n "$NAMESPACE" \

          -l kubevirt.io=virt-controller

    # Check virt-controller logs for errors

    # Replace <virt-controller-pod> with a pod name from the list above

    oc logs -n "$NAMESPACE" <virt-controller-pod>

        # Check virt-controller logs for errors

        # Replace <virt-controller-pod> with a pod name from the list above

        $ oc logs -n "$NAMESPACE" <virt-controller-pod>

    # Verify virt-handler is running

    oc get pods -n "$NAMESPACE" \

      -l kubevirt.io=virt-handler -o wide

        # Verify virt-handler is running

        $ oc get pods -n "$NAMESPACE" \

          -l kubevirt.io=virt-handler -o wide

    # Check virt-handler logs for errors (daemonset uses per-node pods)

    # Replace <virt-handler-pod> with a pod name from the list above

    oc logs -n "$NAMESPACE" <virt-handler-pod>

    ```

        # Check virt-handler logs for errors (daemonset uses per-node pods)

        # Replace <virt-handler-pod> with a pod name from the list above

        $ oc logs -n "$NAMESPACE" <virt-handler-pod>

        ```

    ### 5. Review VM Specification

    5. Review VM specification

    Inspect the following details in the VM's spec to catch common

    misconfigurations:

    @@ -158,91 +158,91 @@ misconfigurations:
  
    ## Mitigation

    ### Resource Issues

    1. **Scale up the cluster** if the issue is insufficient resources

    2. **Create missing storage classes** or configure default storage

    3. **Resolve PVC/DataVolume failures**:

    ### Resource issues

    - **Scale up the cluster** if the issue is insufficient resources

    - **Create missing storage classes** or configure default storage

    - **Resolve PVC/DataVolume failures**:

       ```bash

       oc get pvc -n <namespace>

       oc describe pvc <pvc-name> -n <namespace>

       $ oc get pvc -n <namespace>

       $ oc describe pvc <pvc-name> -n <namespace>

       ```

    ### Image Issues

    1. **Verify image accessibility**:

    ### Image issues

    - **Verify image accessibility**:

       ```bash

       # Validate from the node

       oc debug node/<node-name> -it --image=busybox

       $ oc debug node/<node-name> -it --image=busybox

       # Inside the debug pod, detect runtime and pull

       ps aux | grep -E "(containerd|dockerd|crio)"

       $ ps aux | grep -E "(containerd|dockerd|crio)"

       # For CRI-O/containerd clusters:

       crictl pull <image-name>

       $ crictl pull <image-name>

       # For Docker-based clusters (less common):

       docker pull <image-name>

       $ docker pull <image-name>

       exit

       $ exit

       ```

    2. **Configure image pull secrets** if needed:

    - **Configure image pull secrets** if needed:

       ```bash

       oc create secret docker-registry <secret-name> \

       $ oc create secret docker-registry <secret-name> \

         --docker-server=<registry-url> \

         --docker-username=<username> \

         --docker-password=<password>

       ```

    ### Scheduling Issues

    1. **Review VM scheduling constraints** and relax if too restrictive:

    ### Scheduling issues

    - **Review VM scheduling constraints** and relax if too restrictive:

       - nodeSelector, affinity, and tolerations

       - Required CPU model, host devices, or features

    2. **Verify that node taints and tolerations** allow scheduling:

    - **Verify that node taints and tolerations** allow scheduling:

       - Ensure the VM tolerates node taints that apply to target nodes

    3. **Ensure that nodes have required capabilities**:

    - **Ensure that nodes have required capabilities**:

       - KVM availability, CPU features, GPU, SR-IOV, or storage access

    4. If nodes were intentionally cordoned for maintenance, **uncordon** when

    - If nodes were intentionally cordoned for maintenance, **uncordon** when

       appropriate:

       ```bash

       oc uncordon <node-name>

       $ oc uncordon <node-name>

       ```

    ### Configuration Issues Resolution

    1. **Fix VM specification errors** based on the *oc describe* output:

    ### Configuration issues resolution

    - **Fix VM specification errors** based on the *oc describe* output:

       ```bash

       # Edit VM specification directly

       oc edit vm <vm-name> -n <namespace>

       $ oc edit vm <vm-name> -n <namespace>

       # Or patch specific fields

       oc patch vm <vm-name> -n <namespace> --type='merge' \

       $ oc patch vm <vm-name> -n <namespace> --type='merge' \

         -p='{"spec":{"template":{"spec":{"domain":{"resources": \

           {"requests":{"memory":"2Gi"}}}}}}}'

       ```

    2. **Create missing secrets/configmaps**:

    - **Create missing secrets/configmaps**:

       ```bash

       oc create secret generic <secret-name> \

       $ oc create secret generic <secret-name> \

         --from-literal=key=value

       ```

    3. **Adjust resource requests** if they exceed node capacity

    - **Adjust resource requests** if they exceed node capacity

    ### Emergency Workarounds

    ### Emergency workarounds

    - **Restart the VM** to apply specification changes:

      ```bash

      # Restart the VM to pick up spec changes

      virtctl restart <vm-name> -n <namespace>

      $ virtctl restart <vm-name> -n <namespace>

      ```

    - **Scale down non-critical workloads** temporarily if resource

      constraints exist

    - **Change storage class** if PVC provisioning fails:

      ```bash

      # Check current storage class status

      oc get storageclass

      oc describe storageclass <current-storage-class>

      $ oc get storageclass

      $ oc describe storageclass <current-storage-class>

      # Look for PVC provisioning errors

      oc describe pvc <pvc-name> -n <namespace>

      $ oc describe pvc <pvc-name> -n <namespace>

      # If seeing "no volume provisioner" or similar errors,

      # specify a working storage class in VM spec:

    @@ -252,12 +252,12 @@ misconfigurations:
  
    ## Prevention

    1. **Resource Planning:**

    - **Resource planning:**

       - Monitor cluster resource utilization

       - Set appropriate VM guest resources in the VM domain guest spec.

       - Plan storage capacity and provisioning

    2. **Image Management:**

    - **Image management:**

       - Use local image registries where possible to reduce

         latency

       - Configure DataVolume import methods appropriately:

    @@ -267,12 +267,12 @@ misconfigurations:
  
       - Pre-pull critical containerDisk images to nodes only if

         using node import method

    3. **Monitoring:**

    - **Monitoring:**

       - Set up alerts for cluster resource exhaustion

       - Monitor storage provisioner health

       - Track VM startup success rates

    4. **Testing:**

    - **Testing:**

       - Validate VM templates in development environments

       - Test VM deployments after cluster changes

       - Regularly verify image accessibility

    @@ -286,7 +286,7 @@ Escalate to the cluster administrator if:
  
    - You are unable to access system logs for further diagnosis

    - You do not have enough permissions to run the diagnosis or mitigation steps

    ## Related Alerts

    ## Related alerts

    - `VirtControllerDown` - May indicate controller issues preventing

      VM processing

    @@ -295,6 +295,6 @@ Escalate to the cluster administrator if:
  
      for VM scheduling

    If you cannot resolve the issue, log in to the

    [Customer Portal](https://access.redhat.com) and open a support

    [Red Hat Customer Portal](https://access.redhat.com) and open a support

    case, attaching the artifacts gathered during the diagnosis

    procedure.
Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync CNV runbook VirtualMachineStuckInUnhealthyState.md (Updated at 2026-02-03 12:16:09 +0000 UTC) #399

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Sync CNV runbook VirtualMachineStuckInUnhealthyState.md (Updated at 2026-02-03 12:16:09 +0000 UTC) #399

Are you sure you want to change the base?

Uh oh!

Sync CNV runbook VirtualMachineStuckInUnhealthyState.md (Updated at 2026-02-03 12:16:09 +0000 UTC) #399

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing