monitoring

VirtLauncherPodsStuckFailed

Meaning

A large number of virt-launcher Pods remain in the Failed phase.

Condition: cluster:kubevirt_virt_launcher_failed:count >= 200 appears for 10 minutes. Virt-launcher Pods host virtual machine (VM) workloads and mass failures can indicate migration loops, image/network/storage issues, or control-plane regressions.

Impact

VMs and the cluster control plane might be affected:

Diagnosis

  1. Confirm scope and distribution:
     cluster:kubevirt_virt_launcher_failed:count
    
     count by (namespace) (kube_pod_status_phase{phase="Failed", pod=~"virt-launcher-.*"} == 1)
    

    ```promql count by (node) ( (kube_pod_status_phase{phase=”Failed”, pod=~”virt-launcher-.*”} == 1)

    • on(pod) group_left(node) kube_pod_info{pod=~”virt-launcher-.*”, node!=””} )
       ```promql
       topk(5, count by (reason) (kube_pod_container_status_last_terminated_reason{pod=~"virt-launcher-.*"} == 1))
      
  2. Sample failed Pods and events:
     # List a few failed virt-launcher pods cluster-wide
     $ kubectl get pods -A -l kubevirt.io=virt-launcher --field-selector=status.phase=Failed --no-headers | head -n 20
    
     # Inspect events for a representative pod (image/CNI/storage/useful errors)
     $ kubectl -n <namespace> describe pod <virt-launcher-pod> | sed -n '/Events/,$p'
    
  3. Check for migration storms:
     $ kubectl get vmim -A
    
  4. In the control plane and component logs, look for spikes/errors:
     NAMESPACE="$(kubectl get kubevirt -A -o jsonpath='{.items[0].metadata.namespace}')"
    
     $ kubectl -n "$NAMESPACE" logs -l kubevirt.io=virt-controller --tail=200
    
     $ kubectl -n "$NAMESPACE" logs -l kubevirt.io=virt-handler --tail=200
    
  5. Check the infrastructure for common causes:

Mitigation

  1. Reduce the blast radius:

    • Migration loop: cancel in-flight migrations (scope as needed).
      $ kubectl delete vmim -A
      
    • Coordinate with noisy tenants; pause offending workloads if necessary.
  2. Clean up failed Pods (relieves API/etcd and monitoring):

     $ kubectl get pods -A -l kubevirt.io=virt-launcher --field-selector=status.phase=Failed -o name | xargs -r -n50 kubectl delete
    
  3. Resolve root cause:

    • Image issues: fix registry access, credentials, or tags; re-run affected workloads.
    • Network/CNI: fix CNI/data-plane errors; confirm new Pods start cleanly.
    • Storage: resolve attach/mount failures; verify PVC/VolumeSnapshot health.
    • KubeVirt regression: roll forward/back to a known-good version and re-try.
  4. Validate resolution (alert clears):

     cluster:kubevirt_virt_launcher_failed:count
    

Ensure the failed count drops and stays below threshold and that new virt-launcher Pods start successfully and VMIs are healthy.

If you cannot resolve the issue, see the following resources: