This alert fires when a VirtualMachine with an associated VMI (VirtualMachineInstance) has been stuck in an unhealthy state for more than 5 minutes on a specific node.
The alert indicates that a VirtualMachine has progressed past initial scheduling and has an active VMI, but is experiencing runtime issues on the assigned node. This typically occurs after the VM has been scheduled to a node but encounters problems during startup, operation, or shutdown phases.
Affected States:
Starting
- VMI exists but VM is failing to reach running stateStopping
- VM is attempting to stop but the process is stuckTerminating
- VM is being deleted but the termination process is hangingError
states - Runtime errors occurring on the node (ErrImagePull,
ImagePullBackOff, etc.)# Get VM details with node information
kubectl get vm <vm-name> -n <namespace> -o yaml
# Check VMI status and node assignment
kubectl get vmi <vm-name> -n <namespace> -o yaml
kubectl describe vmi <vm-name> -n <namespace>
# Look for related events
kubectl get events -n <namespace> \
--field-selector involvedObject.name=<vm-name>
# Find the virt-launcher pod for this VM
kubectl get pods -n <namespace> -l kubevirt.io/domain=<vm-name>
# Check pod status and events
kubectl describe pod <virt-launcher-pod> -n <namespace>
# Check pod logs for errors
kubectl logs <virt-launcher-pod> -n <namespace> -c compute
kubectl logs <virt-launcher-pod> -n <namespace> -c istio-proxy \
# if using Istio
# Optional: Check resource usage for the virt-launcher pod
kubectl top pod <virt-launcher-pod> -n <namespace>
# Check node status and conditions (may require admin
# permissions)
kubectl describe node <node-name>
# Discover the KubeVirt installation namespace
export NAMESPACE="$(kubectl get kubevirt -A -o custom-columns="":.metadata.namespace)"
# Check virt-handler on the affected node
kubectl get pods -n "$NAMESPACE" -o wide | grep <node-name>
kubectl logs <virt-handler-pod> -n "$NAMESPACE"
# Verify PVC status and mounting
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>
# Check volume attachments on the node
kubectl get volumeattachment | grep <node-name>
# For DataVolumes, check their status
kubectl get dv -n <namespace>
kubectl describe dv <dv-name> -n <namespace>
# Verify image accessibility from the affected node
kubectl debug node/<node-name> -it --image=busybox
# Inside the debug pod, check which container runtime is used
ps aux | grep -E "(containerd|dockerd|crio)"
# For CRI-O/containerd clusters:
crictl pull <vm-disk-image>
# For Docker-based clusters (less common):
docker pull <vm-disk-image>
# Exit the debug session when done
exit
kubectl exec -it <virt-launcher-pod> -n <namespace> -c compute \
-- virsh list --all | grep <vm-name>
kubectl exec -it <virt-launcher-pod> -n <namespace> -c compute \
-- virsh dumpxml <domain-name>
kubectl delete pod <virt-launcher-pod> -n <namespace>
# The VMI controller will recreate it
kubectl describe pod <virt-launcher-pod> -n <namespace>
# Look for resource limit violations
# SSH to the node or start a debug session on the node:
kubectl debug node/<node-name> -it --image=busybox
# Detect which container runtime is in use
ps aux | grep -E "(containerd|dockerd|crio)"
# List cached images first
# For CRI-O/containerd clusters:
crictl images
# For Docker-based clusters:
docker images
# Remove only if a corrupted/stale image is suspected
# For CRI-O/containerd clusters:
crictl rmi <problematic-image>
# For Docker-based clusters:
docker rmi <problematic-image>
exit
# Delete and recreate the virt-launcher pod
kubectl delete pod <virt-launcher-pod> -n <namespace>
kubectl get pvc -n <namespace>
# If PVC is stuck, check the storage provisioner
kubectl get volumeattachment
# Delete stuck volume attachments if necessary
kubectl delete volumeattachment <attachment-name>
kubectl drain <node-name> --ignore-daemonsets \
--delete-emptydir-data
kubectl uncordon <node-name>
# Restart virt-handler on the node
kubectl delete pod <virt-handler-pod> -n "$NAMESPACE"
kubectl delete vmi <vm-name> -n <namespace> --force \
--grace-period=0
virtctl migrate <vm-name> -n <namespace>
kubectl delete vmi <vm-name> -n <namespace> --force \
--grace-period=0
Escalate to the cluster administrator if:
OrphanedVirtualMachineInstances
- May indicate virt-handler
problems on nodesVirtHandlerDown
- Related to virt-handler pod failuresVirtualMachineStuckInUnhealthyState
- For VMs that haven’t
progressed to having VMIsIf you cannot resolve the issue, see the following resources: