A large number of virt-launcher Pods remain in the Failed phase.
Condition: cluster:kubevirt_virt_launcher_failed:count >= 200 appears for 10 minutes.
Virt-launcher Pods host virtual machine (VM) workloads and mass failures can indicate
migration loops, image/network/storage issues, or control-plane regressions.
VMs and the cluster control plane might be affected:
cluster:kubevirt_virt_launcher_failed:count
count by (namespace) (kube_pod_status_phase{phase="Failed", pod=~"virt-launcher-.*"} == 1)
```promql count by (node) ( (kube_pod_status_phase{phase=”Failed”, pod=~”virt-launcher-.*”} == 1)
```promql
topk(5, count by (reason) (kube_pod_container_status_last_terminated_reason{pod=~"virt-launcher-.*"} == 1))
# List a few failed virt-launcher pods cluster-wide
$ kubectl get pods -A -l kubevirt.io=virt-launcher --field-selector=status.phase=Failed --no-headers | head -n 20
# Inspect events for a representative pod (image/CNI/storage/useful errors)
$ kubectl -n <namespace> describe pod <virt-launcher-pod> | sed -n '/Events/,$p'
$ kubectl get vmim -A
NAMESPACE="$(kubectl get kubevirt -A -o jsonpath='{.items[0].metadata.namespace}')"
$ kubectl -n "$NAMESPACE" logs -l kubevirt.io=virt-controller --tail=200
$ kubectl -n "$NAMESPACE" logs -l kubevirt.io=virt-handler --tail=200
ImagePullBackOff eventsReduce the blast radius:
$ kubectl delete vmim -A
Clean up failed Pods (relieves API/etcd and monitoring):
$ kubectl get pods -A -l kubevirt.io=virt-launcher --field-selector=status.phase=Failed -o name | xargs -r -n50 kubectl delete
Resolve root cause:
Validate resolution (alert clears):
cluster:kubevirt_virt_launcher_failed:count
Ensure the failed count drops and stays below threshold and that new virt-launcher Pods start successfully and VMIs are healthy.
If you cannot resolve the issue, see the following resources: