diff --git a/alerts/openshift-virtualization-operator/VirtualMachineStuckInUnhealthyState.md b/alerts/openshift-virtualization-operator/VirtualMachineStuckInUnhealthyState.md index cb0fce62..ef713835 100644 --- a/alerts/openshift-virtualization-operator/VirtualMachineStuckInUnhealthyState.md +++ b/alerts/openshift-virtualization-operator/VirtualMachineStuckInUnhealthyState.md @@ -2,7 +2,7 @@ ## Meaning -This alert triggers when a VirtualMachine (VM) has been in an unhealthy state +This alert triggers when a virtual machine (VM) is in an unhealthy state for more than 10 minutes and does not have an associated VMI (VirtualMachineInstance). @@ -12,41 +12,41 @@ initial phases of VM startup when OpenShift Virtualization is trying to provision resources, pull images, or schedule the workload. -**Affected States:** -- `Provisioning` - Resources (DataVolumes, PVCs) are being prepared +**Affected states:** +- `Provisioning` - The VM is preparing resources (DataVolumes, PVCs) - `Starting` - VM is attempting to start but no VMI exists yet - `Terminating` - VM is being deleted but without an active VMI -- `Error` states - Various scheduling, image, or resource allocation errors +- `Error` - Various scheduling, image, or resource allocation errors ## Impact - **Severity:** Warning -- **User Impact:** VMs cannot start or are stuck in error states -- **Business Impact:** Workloads cannot be deployed, which affects application +- **User impact:** VMs cannot start or are stuck in error states +- **Business impact:** Workloads cannot be deployed, which affects application availability -## Possible Causes +## Possible causes -### Resource-Related Issues +### Resource-related issues - Insufficient cluster resources (CPU, memory, storage) - Missing or misconfigured storage classes - PVC provisioning failures - DataVolume creation/import failures -### Image and Registry Issues +### Image and registry issues - Container image pull failures for containerDisk volumes - Registry authentication problems - Network connectivity issues to image registries - Missing or corrupted VM disk images -### Scheduling and Node Issues +### Scheduling and node issues - No schedulable nodes available (all nodes cordoned/unschedulable) - Insufficient resources like KVM/GPU on available nodes - A mismatch between requested and available CPU models - Node selector constraints cannot be satisfied - Taints and tolerations preventing scheduling -### Configuration Issues +### Configuration issues - Invalid VM specifications (malformed YAML, unsupported features) - Missing required Secrets or ConfigMaps - Incorrect resource requests/limits @@ -54,86 +54,86 @@ pull images, or schedule the workload. ## Diagnosis -### 1. Check VM Status and Events -```bash -# Get VM details and status -oc get vm -n -o yaml +1. Check VM status and events + ```bash + # Get VM details and status + $ oc get vm -n -o yaml -# Check VM events for error messages -oc describe vm -n + # Check VM events for error messages + $ oc describe vm -n -# Look for related events in the namespace -oc get events -n --sort-by='.lastTimestamp' -``` + # Look for related events in the namespace + $ oc get events -n --sort-by='.lastTimestamp' + ``` -### 2. Verify Resource Availability -```bash -# Check node resources and schedulability -oc get nodes -o wide -oc describe nodes +2. Verify resource availability + ```bash + # Check node resources and schedulability + $ oc get nodes -o wide + $ oc describe nodes -# Check storage classes and provisioners -oc get storageclass -oc get pv,pvc -n + # Check storage classes and provisioners + $ oc get storageclass + $ oc get pv,pvc -n -# For DataVolumes (if using) -oc get datavolume -n -oc describe datavolume -n -``` + # For DataVolumes (if using) + $ oc get datavolume -n + $ oc describe datavolume -n + ``` -### 3. Check Image Availability (for containerDisk) -```bash -# If using containerDisk, verify image accessibility from the affected node -# Start a debug session on the node hosting the VM (or a representative node) -oc debug node/ -it --image=busybox +3. Check image availability (for containerDisk) + ```bash + # If using containerDisk, verify image accessibility from the affected node + # Start a debug session on the node hosting the VM (or a representative node) + $ oc debug node/ -it --image=busybox -# Inside the debug pod, check which container runtime is used -ps aux | grep -E "(containerd|dockerd|crio)" + # Inside the debug pod, check which container runtime is used + $ ps aux | grep -E "(containerd|dockerd|crio)" -# For CRI-O/containerd clusters use crictl to pull the image -crictl pull + # For CRI-O/containerd clusters use crictl to pull the image + $ crictl pull -# For Docker-based clusters (less common) -docker pull + # For Docker-based clusters (less common) + $ docker pull -# Exit the debug session when done -exit + # Exit the debug session when done + $ exit -# Check image pull secrets if required -oc get secrets -n -``` + # Check image pull secrets if required + $ oc get secrets -n + ``` -### 4. Verify KubeVirt Configuration -```bash -# Discover the KubeVirt installation namespace -export NAMESPACE="$(oc get kubevirt -A -o custom-columns="":.metadata.namespace)" +4. Verify OpenShift Virtualization configuration + ```bash + # Discover the KubeVirt installation namespace + $ export NAMESPACE="$(oc get kubevirt -A -o custom-columns="":.metadata.namespace)" -# Check KubeVirt CR conditions (expect Available=True) -oc get kubevirt -n "$NAMESPACE" \ - -o jsonpath='{range .items[*].status.conditions[*]}{.type}={.status}{"\n"}{end}' + # Check KubeVirt CR conditions (expect Available=True) + $ oc get kubevirt -n "$NAMESPACE" \ + -o jsonpath='{range .items[*].status.conditions[*]}{.type}={.status}{"\n"}{end}' -# Or check a single CR named 'kubevirt' -oc get kubevirt kubevirt -n "$NAMESPACE" \ - -o jsonpath='{.status.conditions[?(@.type=="Available")].status}' + # Or check a single CR named 'kubevirt' + $ oc get kubevirt kubevirt -n "$NAMESPACE" \ + -o jsonpath='{.status.conditions[?(@.type=="Available")].status}' -# Verify virt-controller is running -oc get pods -n "$NAMESPACE" \ - -l kubevirt.io=virt-controller + # Verify virt-controller is running + $ oc get pods -n "$NAMESPACE" \ + -l kubevirt.io=virt-controller -# Check virt-controller logs for errors -# Replace with a pod name from the list above -oc logs -n "$NAMESPACE" + # Check virt-controller logs for errors + # Replace with a pod name from the list above + $ oc logs -n "$NAMESPACE" -# Verify virt-handler is running -oc get pods -n "$NAMESPACE" \ - -l kubevirt.io=virt-handler -o wide + # Verify virt-handler is running + $ oc get pods -n "$NAMESPACE" \ + -l kubevirt.io=virt-handler -o wide -# Check virt-handler logs for errors (daemonset uses per-node pods) -# Replace with a pod name from the list above -oc logs -n "$NAMESPACE" -``` + # Check virt-handler logs for errors (daemonset uses per-node pods) + # Replace with a pod name from the list above + $ oc logs -n "$NAMESPACE" + ``` -### 5. Review VM Specification +5. Review VM specification Inspect the following details in the VM's spec to catch common misconfigurations: @@ -158,91 +158,91 @@ misconfigurations: ## Mitigation -### Resource Issues -1. **Scale up the cluster** if the issue is insufficient resources -2. **Create missing storage classes** or configure default storage -3. **Resolve PVC/DataVolume failures**: +### Resource issues +- **Scale up the cluster** if the issue is insufficient resources +- **Create missing storage classes** or configure default storage +- **Resolve PVC/DataVolume failures**: ```bash - oc get pvc -n - oc describe pvc -n + $ oc get pvc -n + $ oc describe pvc -n ``` -### Image Issues -1. **Verify image accessibility**: +### Image issues +- **Verify image accessibility**: ```bash # Validate from the node - oc debug node/ -it --image=busybox + $ oc debug node/ -it --image=busybox # Inside the debug pod, detect runtime and pull - ps aux | grep -E "(containerd|dockerd|crio)" + $ ps aux | grep -E "(containerd|dockerd|crio)" # For CRI-O/containerd clusters: - crictl pull + $ crictl pull # For Docker-based clusters (less common): - docker pull + $ docker pull - exit + $ exit ``` -2. **Configure image pull secrets** if needed: +- **Configure image pull secrets** if needed: ```bash - oc create secret docker-registry \ + $ oc create secret docker-registry \ --docker-server= \ --docker-username= \ --docker-password= ``` -### Scheduling Issues -1. **Review VM scheduling constraints** and relax if too restrictive: +### Scheduling issues +- **Review VM scheduling constraints** and relax if too restrictive: - nodeSelector, affinity, and tolerations - Required CPU model, host devices, or features -2. **Verify that node taints and tolerations** allow scheduling: +- **Verify that node taints and tolerations** allow scheduling: - Ensure the VM tolerates node taints that apply to target nodes -3. **Ensure that nodes have required capabilities**: +- **Ensure that nodes have required capabilities**: - KVM availability, CPU features, GPU, SR-IOV, or storage access -4. If nodes were intentionally cordoned for maintenance, **uncordon** when +- If nodes were intentionally cordoned for maintenance, **uncordon** when appropriate: ```bash - oc uncordon + $ oc uncordon ``` -### Configuration Issues Resolution -1. **Fix VM specification errors** based on the *oc describe* output: +### Configuration issues resolution +- **Fix VM specification errors** based on the *oc describe* output: ```bash # Edit VM specification directly - oc edit vm -n + $ oc edit vm -n # Or patch specific fields - oc patch vm -n --type='merge' \ + $ oc patch vm -n --type='merge' \ -p='{"spec":{"template":{"spec":{"domain":{"resources": \ {"requests":{"memory":"2Gi"}}}}}}}' ``` -2. **Create missing secrets/configmaps**: +- **Create missing secrets/configmaps**: ```bash - oc create secret generic \ + $ oc create secret generic \ --from-literal=key=value ``` -3. **Adjust resource requests** if they exceed node capacity +- **Adjust resource requests** if they exceed node capacity -### Emergency Workarounds +### Emergency workarounds - **Restart the VM** to apply specification changes: ```bash # Restart the VM to pick up spec changes - virtctl restart -n + $ virtctl restart -n ``` - **Scale down non-critical workloads** temporarily if resource constraints exist - **Change storage class** if PVC provisioning fails: ```bash # Check current storage class status - oc get storageclass - oc describe storageclass + $ oc get storageclass + $ oc describe storageclass # Look for PVC provisioning errors - oc describe pvc -n + $ oc describe pvc -n # If seeing "no volume provisioner" or similar errors, # specify a working storage class in VM spec: @@ -252,12 +252,12 @@ misconfigurations: ## Prevention -1. **Resource Planning:** +- **Resource planning:** - Monitor cluster resource utilization - Set appropriate VM guest resources in the VM domain guest spec. - Plan storage capacity and provisioning -2. **Image Management:** +- **Image management:** - Use local image registries where possible to reduce latency - Configure DataVolume import methods appropriately: @@ -267,12 +267,12 @@ misconfigurations: - Pre-pull critical containerDisk images to nodes only if using node import method -3. **Monitoring:** +- **Monitoring:** - Set up alerts for cluster resource exhaustion - Monitor storage provisioner health - Track VM startup success rates -4. **Testing:** +- **Testing:** - Validate VM templates in development environments - Test VM deployments after cluster changes - Regularly verify image accessibility @@ -286,7 +286,7 @@ Escalate to the cluster administrator if: - You are unable to access system logs for further diagnosis - You do not have enough permissions to run the diagnosis or mitigation steps -## Related Alerts +## Related alerts - `VirtControllerDown` - May indicate controller issues preventing VM processing @@ -295,6 +295,6 @@ Escalate to the cluster administrator if: for VM scheduling If you cannot resolve the issue, log in to the -[Customer Portal](https://access.redhat.com) and open a support +[Red Hat Customer Portal](https://access.redhat.com) and open a support case, attaching the artifacts gathered during the diagnosis procedure. \ No newline at end of file