monitoring

VirtualMachineStuckOnNode

Meaning

This alert triggers when a virtual machine (VM) with an associated VirtualMachineInstance (VMI) has been stuck in an unhealthy state for more than 5 minutes on a specific node.

The alert indicates that the VM has progressed past initial scheduling and has an active VMI, but is experiencing runtime issues on the assigned node. This typically occurs after the VM has been scheduled to a node but encounters problems during the startup, operation, or shutdown phases.

Affected states:

Impact

Possible Causes

Node-level issues

QEMU/KVM issues

Image and storage issues

virt-launcher pod issues

libvirt/domain issues

Diagnosis

  1. Check VM and VMI status:
     # Get VM details with node information
     $ kubectl get vm <vm-name> -n <namespace> -o yaml
    
     # Check VMI status and node assignment
     $ kubectl get vmi <vm-name> -n <namespace> -o yaml
     $ kubectl describe vmi <vm-name> -n <namespace>
    
     # Look for related events
     $ kubectl get events -n <namespace> \
       --field-selector involvedObject.name=<vm-name>
    
  2. Examine virt-launcher pod:
     # Find the virt-launcher pod for this VM
     $ kubectl get pods -n <namespace> -l kubevirt.io/domain=<vm-name>
    
     # Check pod status and events
     $ kubectl describe pod <virt-launcher-pod> -n <namespace>
    
     # Check pod logs for errors
     $ kubectl logs <virt-launcher-pod> -n <namespace> -c compute
     $ kubectl logs <virt-launcher-pod> -n <namespace> -c istio-proxy \
       # if using Istio
    
     # Optional: Check resource usage for the virt-launcher pod
     $ kubectl top pod <virt-launcher-pod> -n <namespace>
    
  3. Investigate node health:
     # Check node status and conditions (may require admin
     # permissions)
     $ kubectl describe node <node-name>
    
     # Discover the KubeVirt installation namespace
     $ export NAMESPACE="$(kubectl get kubevirt -A -o custom-columns="":.metadata.namespace)"
    
     # Check virt-handler on the affected node
     $ kubectl get pods -n "$NAMESPACE" -o wide | grep <node-name>
     $ kubectl logs <virt-handler-pod> -n "$NAMESPACE"
    
  4. Check storage and volumes:
     # Verify PVC status and mounting
     $ kubectl get pvc -n <namespace>
     $ kubectl describe pvc <pvc-name> -n <namespace>
    
     # Check volume attachments on the node
     $ kubectl get volumeattachment | grep <node-name>
    
     # For DataVolumes, check their status
     $ kubectl get dv -n <namespace>
     $ kubectl describe dv <dv-name> -n <namespace>
    
  5. Verify image accessibility from node:
     # Verify image accessibility from the affected node
     $ kubectl debug node/<node-name> -it --image=busybox
    
     # Inside the debug pod, check which container runtime is used
     $ ps aux | grep -E "(containerd|dockerd|crio)"
    
     # For CRI-O/containerd clusters:
     $ crictl pull <vm-disk-image>
    
     # For Docker-based clusters (less common):
     $ docker pull <vm-disk-image>
    
     # Exit the debug session when done
     $ exit
    
  6. Exec into the virt‑launcher pod’s compute container and inspect domains:
     $ kubectl exec -it <virt-launcher-pod> -n <namespace> -c compute \
       -- virsh list --all | grep <vm-name>
     $ kubectl exec -it <virt-launcher-pod> -n <namespace> -c compute \
       -- virsh dumpxml <domain-name>
    

Mitigation

Pod-Level Issues

Image Issues on Node

Storage Issues

Node-Level Issues Resolution

VM-Level Resolution

Emergency Actions

Prevention

Escalation

Escalate to the cluster administrator if:

If you cannot resolve the issue, see the following resources: