monitoring

VirtualMachineStuckInUnhealthyState

Meaning

This alert triggers when a virtual machine (VM) is in an unhealthy state for more than 10 minutes and does not have an associated VMI (VirtualMachineInstance).

The alert indicates that a VM is experiencing early-stage lifecycle issues before a VMI can be successfully created. This typically occurs during the initial phases of VM startup when KubeVirt is trying to provision resources, pull images, or schedule the workload.

Affected states:

Impact

Possible causes

Image and registry issues

Scheduling and node issues

Configuration issues

Diagnosis

  1. Check VM status and events
     # Get VM details and status
     $ kubectl get vm <vm-name> -n <namespace> -o yaml
    
     # Check VM events for error messages
     $ kubectl describe vm <vm-name> -n <namespace>
    
     # Look for related events in the namespace
     $ kubectl get events -n <namespace> --sort-by='.lastTimestamp'
    
  2. Verify resource availability
     # Check node resources and schedulability
     $ kubectl get nodes -o wide
     $ kubectl describe nodes
    
     # Check storage classes and provisioners
     $ kubectl get storageclass
     $ kubectl get pv,pvc -n <namespace>
    
     # For DataVolumes (if using)
     $ kubectl get datavolume -n <namespace>
     $ kubectl describe datavolume <dv-name> -n <namespace>
    
  3. Check image availability (for containerDisk)
     # If using containerDisk, verify image accessibility from the affected node
     # Start a debug session on the node hosting the VM (or a representative node)
     $ kubectl debug node/<node-name> -it --image=busybox
    
     # Inside the debug pod, check which container runtime is used
     $ ps aux | grep -E "(containerd|dockerd|crio)"
    
     # For CRI-O/containerd clusters use crictl to pull the image
     $ crictl pull <vm-disk-image>
    
     # For Docker-based clusters (less common)
     $ docker pull <vm-disk-image>
    
     # Exit the debug session when done
     $ exit
    
     # Check image pull secrets if required
     $ kubectl get secrets -n <namespace>
    
  4. Verify KubeVirt configuration
     # Discover the KubeVirt installation namespace
     $ export NAMESPACE="$(kubectl get kubevirt -A -o custom-columns="":.metadata.namespace)"
    
     # Check KubeVirt CR conditions (expect Available=True)
     $ kubectl get kubevirt -n "$NAMESPACE" \
       -o jsonpath='{range .items[*].status.conditions[*]}{.type}={.status}{"\n"}{end}'
    
     # Or check a single CR named 'kubevirt'
     $ kubectl get kubevirt kubevirt -n "$NAMESPACE" \
       -o jsonpath='{.status.conditions[?(@.type=="Available")].status}'
    
     # Verify virt-controller is running
     $ kubectl get pods -n "$NAMESPACE" \
       -l kubevirt.io=virt-controller
    
     # Check virt-controller logs for errors
     # Replace <virt-controller-pod> with a pod name from the list above
     $ kubectl logs -n "$NAMESPACE" <virt-controller-pod>
    
     # Verify virt-handler is running
     $ kubectl get pods -n "$NAMESPACE" \
       -l kubevirt.io=virt-handler -o wide
    
     # Check virt-handler logs for errors (daemonset uses per-node pods)
     # Replace <virt-handler-pod> with a pod name from the list above
     $ kubectl logs -n "$NAMESPACE" <virt-handler-pod>
    
  5. Review VM specification Inspect the following details in the VM’s spec to catch common misconfigurations:

Mitigation

Resource issues

Image issues

Scheduling issues

Configuration issues resolution

Emergency workarounds

Prevention

Escalation

Escalate to the cluster administrator if:

If you cannot resolve the issue, see the following resources: