monitoring

VirtualMachineStuckOnNode

Meaning

This alert fires when a VirtualMachine with an associated VMI (VirtualMachineInstance) has been stuck in an unhealthy state for more than 5 minutes on a specific node.

The alert indicates that a VirtualMachine has progressed past initial scheduling and has an active VMI, but is experiencing runtime issues on the assigned node. This typically occurs after the VM has been scheduled to a node but encounters problems during startup, operation, or shutdown phases.

Affected States:

Impact

Possible Causes

Node-Level Issues

QEMU/KVM Issues

Image and Storage Issues

virt-launcher Pod Issues

libvirt/Domain Issues

Diagnosis

1. Check VM and VMI Status

# Get VM details with node information
kubectl get vm <vm-name> -n <namespace> -o yaml

# Check VMI status and node assignment
kubectl get vmi <vm-name> -n <namespace> -o yaml
kubectl describe vmi <vm-name> -n <namespace>

# Look for related events
kubectl get events -n <namespace> \
  --field-selector involvedObject.name=<vm-name>

2. Examine virt-launcher Pod

# Find the virt-launcher pod for this VM
kubectl get pods -n <namespace> -l kubevirt.io/domain=<vm-name>

# Check pod status and events
kubectl describe pod <virt-launcher-pod> -n <namespace>

# Check pod logs for errors
kubectl logs <virt-launcher-pod> -n <namespace> -c compute
kubectl logs <virt-launcher-pod> -n <namespace> -c istio-proxy \
  # if using Istio

# Optional: Check resource usage for the virt-launcher pod
kubectl top pod <virt-launcher-pod> -n <namespace>

3. Investigate Node Health

# Check node status and conditions (may require admin
# permissions)
kubectl describe node <node-name>

# Discover the KubeVirt installation namespace
export NAMESPACE="$(kubectl get kubevirt -A -o custom-columns="":.metadata.namespace)"

# Check virt-handler on the affected node
kubectl get pods -n "$NAMESPACE" -o wide | grep <node-name>
kubectl logs <virt-handler-pod> -n "$NAMESPACE"

4. Check Storage and Volumes

# Verify PVC status and mounting
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>

# Check volume attachments on the node
kubectl get volumeattachment | grep <node-name>

# For DataVolumes, check their status
kubectl get dv -n <namespace>
kubectl describe dv <dv-name> -n <namespace>

5. Verify Image Accessibility from Node

# Verify image accessibility from the affected node
kubectl debug node/<node-name> -it --image=busybox

# Inside the debug pod, check which container runtime is used
ps aux | grep -E "(containerd|dockerd|crio)"

# For CRI-O/containerd clusters:
crictl pull <vm-disk-image>

# For Docker-based clusters (less common):
docker pull <vm-disk-image>

# Exit the debug session when done
exit

6. Exec into the virt‑launcher pod’s compute container and inspect domains

kubectl exec -it <virt-launcher-pod> -n <namespace> -c compute \
  -- virsh list --all | grep <vm-name>
kubectl exec -it <virt-launcher-pod> -n <namespace> -c compute \
  -- virsh dumpxml <domain-name>

Mitigation

Pod-Level Issues

  1. Restart virt-launcher pod:
    kubectl delete pod <virt-launcher-pod> -n <namespace>
    # The VMI controller will recreate it
    
  2. Check resource constraints:
    kubectl describe pod <virt-launcher-pod> -n <namespace>
    # Look for resource limit violations
    

Image Issues on Node

  1. Inspect and, if necessary, clear image cache on the node:
    # SSH to the node or start a debug session on the node:
    kubectl debug node/<node-name> -it --image=busybox
    
    # Detect which container runtime is in use
    ps aux | grep -E "(containerd|dockerd|crio)"
    
    # List cached images first
    # For CRI-O/containerd clusters:
    crictl images
    # For Docker-based clusters:
    docker images
    
    # Remove only if a corrupted/stale image is suspected
    # For CRI-O/containerd clusters:
    crictl rmi <problematic-image>
    # For Docker-based clusters:
    docker rmi <problematic-image>
    
    exit
    
  2. Force image re-pull:
    # Delete and recreate the virt-launcher pod
    kubectl delete pod <virt-launcher-pod> -n <namespace>
    

Storage Issues

  1. Check PVC binding and mounting:
    kubectl get pvc -n <namespace>
    # If PVC is stuck, check the storage provisioner
    
  2. Resolve volume attachment issues:
    kubectl get volumeattachment
    # Delete stuck volume attachments if necessary
    kubectl delete volumeattachment <attachment-name>
    

Node-Level Issues Resolution

  1. Drain and uncordon node if it’s in a bad state:
    kubectl drain <node-name> --ignore-daemonsets \
      --delete-emptydir-data
    kubectl uncordon <node-name>
    
  2. Restart node-level components:
    # Restart virt-handler on the node
    kubectl delete pod <virt-handler-pod> -n "$NAMESPACE"
    

VM-Level Resolution

  1. Force‑delete the VMI (will trigger new VMI creation):
    kubectl delete vmi <vm-name> -n <namespace> --force \
      --grace-period=0
    
  2. Migrate VM to different node:
    virtctl migrate <vm-name> -n <namespace>
    

Emergency Actions

Prevention

  1. Node Health Monitoring:
    • Monitor node resource utilization (CPU, memory, storage)
    • Set up alerts for node conditions and taints
    • Regular health checks on container runtime
  2. Resource Management:
    • Set appropriate resource requests/limits on VMs
    • Monitor PVC and storage utilization
    • Plan for node capacity and VM density
  3. Image Management:
    • Use image pull policies appropriately (Always, IfNotPresent)
    • Pre-pull critical images to nodes
    • Monitor image registry health and connectivity
  4. Networking:
    • Ensure stable network connectivity between nodes and storage
    • Monitor DNS resolution and service discovery
    • Validate network policies don’t block required traffic
  5. Regular Maintenance:
    • Keep nodes and KubeVirt components updated

Escalation

Escalate to the cluster administrator if:

If you cannot resolve the issue, see the following resources: