From b6a2e0de12604564609eb0f0a673c81134cc2cbf Mon Sep 17 00:00:00 2001 From: hco-bot <71450783+hco-bot@users.noreply.github.com> Date: Wed, 4 Feb 2026 04:57:16 +0000 Subject: [PATCH] Sync CNV runbook VirtualMachineStuckOnNode.md (Updated at 2026-02-03 12:16:09 +0000 UTC) --- .../VirtualMachineStuckOnNode.md | 264 +++++++++--------- 1 file changed, 131 insertions(+), 133 deletions(-) diff --git a/alerts/openshift-virtualization-operator/VirtualMachineStuckOnNode.md b/alerts/openshift-virtualization-operator/VirtualMachineStuckOnNode.md index 146b0a07..6800343d 100644 --- a/alerts/openshift-virtualization-operator/VirtualMachineStuckOnNode.md +++ b/alerts/openshift-virtualization-operator/VirtualMachineStuckOnNode.md @@ -2,60 +2,60 @@ ## Meaning -This alert triggers when a VirtualMachine (VM) with an associated +This alert triggers when a virtual machine (VM) with an associated VirtualMachineInstance (VMI) has been stuck in an unhealthy state for more than 5 minutes on a specific node. The alert indicates that the VM has progressed past initial scheduling and has an active VMI, but is experiencing runtime issues on the assigned node. This typically occurs after the VM has been scheduled to a node but encounters -problems during startup, operation, or shutdown phases. +problems during the startup, operation, or shutdown phases. -**Affected States:** +**Affected states:** - `Starting` - VMI exists but VM is failing to reach running state - `Stopping` - VM is attempting to stop but the process is stuck - `Terminating` - VM is being deleted but the termination process is hanging -- `Error` states - Runtime errors occurring on the node (ErrImagePull, +- `Error` - Runtime errors are occurring on the node (ErrImagePull, ImagePullBackOff, etc.) ## Impact - **Severity:** Warning -- **User Impact:** VMs are unresponsive or stuck in transition states -- **Business Impact:** Running workloads may be disrupted, which affects +- **User impact:** VMs are unresponsive or stuck in transition states +- **Business impact:** Running workloads might be disrupted, which affects application performance and availability - **Node Impact:** Resources may be tied up by unresponsive VMs, which affects other workloads on the same node ## Possible Causes -### Node-Level Issues +### Node-level issues - **Node resource exhaustion** (CPU, memory, storage) - **Container runtime problems** (issues with containerd or CRI-O) - **Insufficient storage** on the node - **Network connectivity issues** from the node - **Node entering the NotReady state** or maintenance mode -### QEMU/KVM Issues +### QEMU/KVM issues - **QEMU process failures** or hangs - **KVM acceleration problems** on the node - **Nested virtualization** configuration issues - **Hardware compatibility** problems -### Image and Storage Issues +### Image and storage issues - **Container image pull failures** specific to the node - **Local image cache corruption** - **PVC mount failures** on the node - **Storage backend connectivity** issues from the node - **Volume attachment timeouts** -### virt-launcher Pod Issues +### virt-launcher pod issues - **virt-launcher pod** stuck in non-ready state - **Pod resource limits** being exceeded - **Security policy violations** (SELinux, AppArmor) - **Networking problems** within the pod -### libvirt/Domain Issues +### libvirt/domain issues - **libvirt daemon** problems on the node - **Domain definition** conflicts or corruption - **Migration failures** (if VM was being migrated) @@ -64,211 +64,209 @@ other workloads on the same node ## Diagnosis -### 1. Check VM and VMI Status -```bash -# Get VM details with node information -oc get vm -n -o yaml - -# Check VMI status and node assignment -oc get vmi -n -o yaml -oc describe vmi -n - -# Look for related events -oc get events -n \ - --field-selector involvedObject.name= -``` - -### 2. Examine virt-launcher Pod -```bash -# Find the virt-launcher pod for this VM -oc get pods -n -l kubevirt.io/domain= - -# Check pod status and events -oc describe pod -n - -# Check pod logs for errors -oc logs -n -c compute -oc logs -n -c istio-proxy \ - # if using Istio - -# Optional: Check resource usage for the virt-launcher pod -oc top pod -n -``` - -### 3. Investigate Node Health -```bash -# Check node status and conditions (may require admin -# permissions) -oc describe node - -# Discover the KubeVirt installation namespace -export NAMESPACE="$(oc get kubevirt -A -o custom-columns="":.metadata.namespace)" - -# Check virt-handler on the affected node -oc get pods -n "$NAMESPACE" -o wide | grep -oc logs -n "$NAMESPACE" -``` - -### 4. Check Storage and Volumes -```bash -# Verify PVC status and mounting -oc get pvc -n -oc describe pvc -n - -# Check volume attachments on the node -oc get volumeattachment | grep - -# For DataVolumes, check their status -oc get dv -n -oc describe dv -n -``` - -### 5. Verify Image Accessibility from Node -```bash -# Verify image accessibility from the affected node -oc debug node/ -it --image=busybox - -# Inside the debug pod, check which container runtime is used -ps aux | grep -E "(containerd|dockerd|crio)" - -# For CRI-O/containerd clusters: -crictl pull - -# For Docker-based clusters (less common): -docker pull - -# Exit the debug session when done -exit -``` - -### 6. Exec into the virt‑launcher pod’s compute container and inspect -domains -```bash -oc exec -it -n -c compute \ - -- virsh list --all | grep -oc exec -it -n -c compute \ - -- virsh dumpxml - -``` +1. Check VM and VMI status: + ```bash + # Get VM details with node information + $ oc get vm -n -o yaml + + # Check VMI status and node assignment + $ oc get vmi -n -o yaml + $ oc describe vmi -n + + # Look for related events + $ oc get events -n \ + --field-selector involvedObject.name= + ``` + +2. Examine virt-launcher pod: + ```bash + # Find the virt-launcher pod for this VM + $ oc get pods -n -l kubevirt.io/domain= + + # Check pod status and events + $ oc describe pod -n + + # Check pod logs for errors + $ oc logs -n -c compute + $ oc logs -n -c istio-proxy \ + # if using Istio + + # Optional: Check resource usage for the virt-launcher pod + $ oc top pod -n + ``` + +3. Investigate node health: + ```bash + # Check node status and conditions (may require admin + # permissions) + $ oc describe node + + # Discover the KubeVirt installation namespace + $ export NAMESPACE="$(oc get kubevirt -A -o custom-columns="":.metadata.namespace)" + + # Check virt-handler on the affected node + $ oc get pods -n "$NAMESPACE" -o wide | grep + $ oc logs -n "$NAMESPACE" + ``` + +4. Check storage and volumes: + ```bash + # Verify PVC status and mounting + $ oc get pvc -n + $ oc describe pvc -n + + # Check volume attachments on the node + $ oc get volumeattachment | grep + + # For DataVolumes, check their status + $ oc get dv -n + $ oc describe dv -n + ``` + +5. Verify image accessibility from node: + ```bash + # Verify image accessibility from the affected node + $ oc debug node/ -it --image=busybox + + # Inside the debug pod, check which container runtime is used + $ ps aux | grep -E "(containerd|dockerd|crio)" + + # For CRI-O/containerd clusters: + $ crictl pull + + # For Docker-based clusters (less common): + $ docker pull + + # Exit the debug session when done + $ exit + ``` + +6. Exec into the virt‑launcher pod’s compute container and inspect domains: + ```bash + $ oc exec -it -n -c compute \ + -- virsh list --all | grep + $ oc exec -it -n -c compute \ + -- virsh dumpxml + ``` ## Mitigation ### Pod-Level Issues -1. **Restart the virt-launcher pod**: +- **Restart the virt-launcher pod**: ```bash - oc delete pod -n + $ oc delete pod -n # The VMI controller will recreate it ``` -2. **Check resource constraints**: +- **Check resource constraints**: ```bash - oc describe pod -n + $ oc describe pod -n # Look for resource limit violations ``` ### Image Issues on Node -1. **Inspect and, if necessary, clear image cache** on the node: +- **Inspect and, if necessary, clear image cache** on the node: ```bash # SSH to the node or start a debug session on the node: - oc debug node/ -it --image=busybox + $ oc debug node/ -it --image=busybox # Detect which container runtime is in use - ps aux | grep -E "(containerd|dockerd|crio)" + $ ps aux | grep -E "(containerd|dockerd|crio)" # List cached images first # For CRI-O/containerd clusters: - crictl images + $ crictl images # For Docker-based clusters: - docker images + $ docker images # Remove only if a corrupted/stale image is suspected # For CRI-O/containerd clusters: - crictl rmi + $ crictl rmi # For Docker-based clusters: - docker rmi + $ docker rmi - exit + $ exit ``` -2. **Force image re-pull**: +- **Force image re-pull**: ```bash # Delete and recreate the virt-launcher pod - oc delete pod -n + $ oc delete pod -n ``` ### Storage Issues -1. **Check PVC binding and mounting**: +- **Check PVC binding and mounting**: ```bash - oc get pvc -n + $ oc get pvc -n # If PVC is stuck, check the storage provisioner ``` -2. **Resolve volume attachment issues**: +- **Resolve volume attachment issues**: ```bash - oc get volumeattachment + $ oc get volumeattachment # Delete stuck volume attachments if necessary - oc delete volumeattachment + $ oc delete volumeattachment ``` ### Node-Level Issues Resolution -1. **Drain and uncordon the node** if it is in a bad state: +- **Drain and uncordon the node** if it is in a bad state: ```bash - oc drain --ignore-daemonsets \ + $ oc drain --ignore-daemonsets \ --delete-emptydir-data - oc uncordon + $ oc uncordon ``` -2. **Restart node-level components**: +- **Restart node-level components**: ```bash # Restart virt-handler on the node - oc delete pod -n "$NAMESPACE" + $ oc delete pod -n "$NAMESPACE" ``` ### VM-Level Resolution -1. **Force‑delete the VMI (triggers creating a new VMI)**: +- **Force‑delete the VMI (triggers creating a new VMI)**: ```bash - oc delete vmi -n --force \ + $ oc delete vmi -n --force \ --grace-period=0 ``` -2. **Migrate the VM to a different node**: +- **Migrate the VM to a different node**: ```bash - virtctl migrate -n + $ virtctl migrate -n ``` ### Emergency Actions - **Live migrate** critical VMs away from the problematic node -- **Force delete** the unresponsive VMI if safe to do so: +- **Force delete** the unresponsive VMI if it is safe to do so: ```bash - oc delete vmi -n --force --grace-period=0 + $ oc delete vmi -n --force --grace-period=0 ``` - **Cordon the node** to prevent new VM scheduling while investigating ## Prevention -1. **Node Health Monitoring:** +- **Node Health Monitoring:** - Monitor node resource utilization (CPU, memory, storage) - Set up alerts for node conditions and taints - Perform regular health checks on container runtime -2. **Resource Management:** +- **Resource Management:** - Set appropriate resource requests/limits on VMs - Monitor PVC and storage utilization - Plan for node capacity and VM density -3. **Image Management:** +- **Image Management:** - Use image pull policies appropriately (Always, IfNotPresent) - Pre-pull critical images to nodes - Monitor image registry health and connectivity -4. **Networking:** +- **Networking:** - Ensure stable network connectivity between nodes and storage - Monitor DNS resolution and service discovery - Validate network policies do not block required traffic -5. **Regular Maintenance:** +- **Regular Maintenance:** - Keep nodes and OpenShift Virtualization components updated ## Escalation @@ -293,6 +291,6 @@ Escalate to the cluster administrator if: progressed to having VMIs If you cannot resolve the issue, log in to the -[Customer Portal](https://access.redhat.com) and open a support +[Red Hat Custromer Portal](https://access.redhat.com) and open a support case, attaching the artifacts gathered during the diagnosis procedure. \ No newline at end of file