This alert triggers when a VirtualMachine (VM) with an associated VirtualMachineInstance (VMI) has been stuck in an unhealthy state for more than 5 minutes on a specific node.
The alert indicates that the VM has progressed past initial scheduling and has an active VMI, but is experiencing runtime issues on the assigned node. This typically occurs after the VM has been scheduled to a node but encounters problems during startup, operation, or shutdown phases.
Affected States:
Starting
- VMI exists but VM is failing to reach running stateStopping
- VM is attempting to stop but the process is stuckTerminating
- VM is being deleted but the termination process is hangingError
states - Runtime errors occurring on the node (ErrImagePull,
ImagePullBackOff, etc.)# Get VM details with node information
kubectl get vm <vm-name> -n <namespace> -o yaml
# Check VMI status and node assignment
kubectl get vmi <vm-name> -n <namespace> -o yaml
kubectl describe vmi <vm-name> -n <namespace>
# Look for related events
kubectl get events -n <namespace> \
--field-selector involvedObject.name=<vm-name>
# Find the virt-launcher pod for this VM
kubectl get pods -n <namespace> -l kubevirt.io/domain=<vm-name>
# Check pod status and events
kubectl describe pod <virt-launcher-pod> -n <namespace>
# Check pod logs for errors
kubectl logs <virt-launcher-pod> -n <namespace> -c compute
kubectl logs <virt-launcher-pod> -n <namespace> -c istio-proxy \
# if using Istio
# Optional: Check resource usage for the virt-launcher pod
kubectl top pod <virt-launcher-pod> -n <namespace>
# Check node status and conditions (may require admin
# permissions)
kubectl describe node <node-name>
# Discover the KubeVirt installation namespace
export NAMESPACE="$(kubectl get kubevirt -A -o custom-columns="":.metadata.namespace)"
# Check virt-handler on the affected node
kubectl get pods -n "$NAMESPACE" -o wide | grep <node-name>
kubectl logs <virt-handler-pod> -n "$NAMESPACE"
# Verify PVC status and mounting
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>
# Check volume attachments on the node
kubectl get volumeattachment | grep <node-name>
# For DataVolumes, check their status
kubectl get dv -n <namespace>
kubectl describe dv <dv-name> -n <namespace>
# Verify image accessibility from the affected node
kubectl debug node/<node-name> -it --image=busybox
# Inside the debug pod, check which container runtime is used
ps aux | grep -E "(containerd|dockerd|crio)"
# For CRI-O/containerd clusters:
crictl pull <vm-disk-image>
# For Docker-based clusters (less common):
docker pull <vm-disk-image>
# Exit the debug session when done
exit
kubectl exec -it <virt-launcher-pod> -n <namespace> -c compute \
-- virsh list --all | grep <vm-name>
kubectl exec -it <virt-launcher-pod> -n <namespace> -c compute \
-- virsh dumpxml <domain-name>
kubectl delete pod <virt-launcher-pod> -n <namespace>
# The VMI controller will recreate it
kubectl describe pod <virt-launcher-pod> -n <namespace>
# Look for resource limit violations
# SSH to the node or start a debug session on the node:
kubectl debug node/<node-name> -it --image=busybox
# Detect which container runtime is in use
ps aux | grep -E "(containerd|dockerd|crio)"
# List cached images first
# For CRI-O/containerd clusters:
crictl images
# For Docker-based clusters:
docker images
# Remove only if a corrupted/stale image is suspected
# For CRI-O/containerd clusters:
crictl rmi <problematic-image>
# For Docker-based clusters:
docker rmi <problematic-image>
exit
# Delete and recreate the virt-launcher pod
kubectl delete pod <virt-launcher-pod> -n <namespace>
kubectl get pvc -n <namespace>
# If PVC is stuck, check the storage provisioner
kubectl get volumeattachment
# Delete stuck volume attachments if necessary
kubectl delete volumeattachment <attachment-name>
kubectl drain <node-name> --ignore-daemonsets \
--delete-emptydir-data
kubectl uncordon <node-name>
# Restart virt-handler on the node
kubectl delete pod <virt-handler-pod> -n "$NAMESPACE"
kubectl delete vmi <vm-name> -n <namespace> --force \
--grace-period=0
virtctl migrate <vm-name> -n <namespace>
kubectl delete vmi <vm-name> -n <namespace> --force --grace-period=0
Escalate to the cluster administrator if:
OrphanedVirtualMachineInstances
- May indicate virt-handler
problems on nodesVirtHandlerDown
- Related to virt-handler pod failuresVirtualMachineStuckInUnhealthyState
- For VMs that haven’t
progressed to having VMIsIf you cannot resolve the issue, see the following resources: