monitoring

VirtualMachineStuckOnNode

Meaning

This alert triggers when a VirtualMachine (VM) with an associated VirtualMachineInstance (VMI) has been stuck in an unhealthy state for more than 5 minutes on a specific node.

The alert indicates that the VM has progressed past initial scheduling and has an active VMI, but is experiencing runtime issues on the assigned node. This typically occurs after the VM has been scheduled to a node but encounters problems during startup, operation, or shutdown phases.

Affected States:

Impact

Possible Causes

Node-Level Issues

QEMU/KVM Issues

Image and Storage Issues

virt-launcher Pod Issues

libvirt/Domain Issues

Diagnosis

1. Check VM and VMI Status

# Get VM details with node information
kubectl get vm <vm-name> -n <namespace> -o yaml

# Check VMI status and node assignment
kubectl get vmi <vm-name> -n <namespace> -o yaml
kubectl describe vmi <vm-name> -n <namespace>

# Look for related events
kubectl get events -n <namespace> \
  --field-selector involvedObject.name=<vm-name>

2. Examine virt-launcher Pod

# Find the virt-launcher pod for this VM
kubectl get pods -n <namespace> -l kubevirt.io/domain=<vm-name>

# Check pod status and events
kubectl describe pod <virt-launcher-pod> -n <namespace>

# Check pod logs for errors
kubectl logs <virt-launcher-pod> -n <namespace> -c compute
kubectl logs <virt-launcher-pod> -n <namespace> -c istio-proxy \
  # if using Istio

# Optional: Check resource usage for the virt-launcher pod
kubectl top pod <virt-launcher-pod> -n <namespace>

3. Investigate Node Health

# Check node status and conditions (may require admin
# permissions)
kubectl describe node <node-name>

# Discover the KubeVirt installation namespace
export NAMESPACE="$(kubectl get kubevirt -A -o custom-columns="":.metadata.namespace)"

# Check virt-handler on the affected node
kubectl get pods -n "$NAMESPACE" -o wide | grep <node-name>
kubectl logs <virt-handler-pod> -n "$NAMESPACE"

4. Check Storage and Volumes

# Verify PVC status and mounting
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>

# Check volume attachments on the node
kubectl get volumeattachment | grep <node-name>

# For DataVolumes, check their status
kubectl get dv -n <namespace>
kubectl describe dv <dv-name> -n <namespace>

5. Verify Image Accessibility from Node

# Verify image accessibility from the affected node
kubectl debug node/<node-name> -it --image=busybox

# Inside the debug pod, check which container runtime is used
ps aux | grep -E "(containerd|dockerd|crio)"

# For CRI-O/containerd clusters:
crictl pull <vm-disk-image>

# For Docker-based clusters (less common):
docker pull <vm-disk-image>

# Exit the debug session when done
exit

6. Exec into the virt‑launcher pod’s compute container and inspect domains

kubectl exec -it <virt-launcher-pod> -n <namespace> -c compute \
  -- virsh list --all | grep <vm-name>
kubectl exec -it <virt-launcher-pod> -n <namespace> -c compute \
  -- virsh dumpxml <domain-name>

Mitigation

Pod-Level Issues

  1. Restart the virt-launcher pod:
    kubectl delete pod <virt-launcher-pod> -n <namespace>
    # The VMI controller will recreate it
    
  2. Check resource constraints:
    kubectl describe pod <virt-launcher-pod> -n <namespace>
    # Look for resource limit violations
    

Image Issues on Node

  1. Inspect and, if necessary, clear image cache on the node:
    # SSH to the node or start a debug session on the node:
    kubectl debug node/<node-name> -it --image=busybox
    
    # Detect which container runtime is in use
    ps aux | grep -E "(containerd|dockerd|crio)"
    
    # List cached images first
    # For CRI-O/containerd clusters:
    crictl images
    # For Docker-based clusters:
    docker images
    
    # Remove only if a corrupted/stale image is suspected
    # For CRI-O/containerd clusters:
    crictl rmi <problematic-image>
    # For Docker-based clusters:
    docker rmi <problematic-image>
    
    exit
    
  2. Force image re-pull:
    # Delete and recreate the virt-launcher pod
    kubectl delete pod <virt-launcher-pod> -n <namespace>
    

Storage Issues

  1. Check PVC binding and mounting:
    kubectl get pvc -n <namespace>
    # If PVC is stuck, check the storage provisioner
    
  2. Resolve volume attachment issues:
    kubectl get volumeattachment
    # Delete stuck volume attachments if necessary
    kubectl delete volumeattachment <attachment-name>
    

Node-Level Issues Resolution

  1. Drain and uncordon the node if it is in a bad state:
    kubectl drain <node-name> --ignore-daemonsets \
      --delete-emptydir-data
    kubectl uncordon <node-name>
    
  2. Restart node-level components:
    # Restart virt-handler on the node
    kubectl delete pod <virt-handler-pod> -n "$NAMESPACE"
    

VM-Level Resolution

  1. Force‑delete the VMI (triggers creating a new VMI):
    kubectl delete vmi <vm-name> -n <namespace> --force \
      --grace-period=0
    
  2. Migrate the VM to a different node:
    virtctl migrate <vm-name> -n <namespace>
    

Emergency Actions

Prevention

  1. Node Health Monitoring:
    • Monitor node resource utilization (CPU, memory, storage)
    • Set up alerts for node conditions and taints
    • Perform regular health checks on container runtime
  2. Resource Management:
    • Set appropriate resource requests/limits on VMs
    • Monitor PVC and storage utilization
    • Plan for node capacity and VM density
  3. Image Management:
    • Use image pull policies appropriately (Always, IfNotPresent)
    • Pre-pull critical images to nodes
    • Monitor image registry health and connectivity
  4. Networking:
    • Ensure stable network connectivity between nodes and storage
    • Monitor DNS resolution and service discovery
    • Validate network policies do not block required traffic
  5. Regular Maintenance:
    • Keep nodes and KubeVirt components updated

Escalation

Escalate to the cluster administrator if:

If you cannot resolve the issue, see the following resources: