No running virt-controller pod has been detected for 10 minutes.
The alert expression evaluates
cluster:kubevirt_virt_controller_pods_running:count == 0 with a
for duration of 10 minutes. The recording rule counts pods in
Running phase matching virt-controller-.*.
In newer versions of KubeVirt, the alert expression is reworked to
surface additional diagnostic labels (pod, reason) when a
container waiting reason is available. If your alert includes these
labels, see step 1 of the diagnosis below.
Any actions related to virtual machine (VM) lifecycle management fail. This notably includes launching a new virtual machine instance (VMI) or shutting down an existing VMI.
Check the alert labels:
If the alert includes a reason label (for example,
CrashLoopBackOff, ErrImagePull, ImagePullBackOff), it
directly identifies why virt-controller is down. The pod
label identifies the affected pod. Skip to
Mitigation for the matching root cause. If these
labels are not present, continue with the steps below.
Set the NAMESPACE environment variable:
$ export NAMESPACE="$(kubectl get kubevirt -A \
-o custom-columns="":.metadata.namespace)"
Check the status of the virt-controller deployment:
$ kubectl -n $NAMESPACE get deploy virt-controller -o yaml
Check the virt-controller deployment details for issues such
as crashing pods or image pull failures:
$ kubectl -n $NAMESPACE describe deploy virt-controller
Check the status of the virt-controller pods:
$ kubectl -n $NAMESPACE get pods \
-l kubevirt.io=virt-controller
Review the logs of the virt-controller pods:
$ kubectl -n $NAMESPACE logs -l kubevirt.io=virt-controller \
--previous
$ kubectl -n $NAMESPACE logs -l kubevirt.io=virt-controller
Check for issues such as nodes in a NotReady state:
$ kubectl get nodes
Try to identify the root cause and resolve the issue. Common causes include:
virt-controller container is
crashing repeatedly. Check the pod logs for the root cause
(panic, OOM, misconfiguration).virt-controller pods exist. Check whether
the deployment has been scaled to zero, deleted, or blocked by
resource constraints.NotReady state or under
resource pressure.If you cannot resolve the issue, see the following resources: