This alert fires when a VM has experienced more than 5 non-recoverable guest
OS panics in the last 24 hours. The alert is based on the
kubevirt_vmi_guest_os_panic_total metric, which tracks panic events for all
panic types (pvpanic, hyper-v, s390, etc.).
A non-recoverable guest OS panic indicates that the guest kernel or operating
system crashed and was unable to recover on its own (e.g. via kdump). The
crash is detected via the pvpanic device (Linux and Windows) or the Hyper-V
enlightenment mechanism (Windows), and reported by QEMU/libvirt to KubeVirt.
In the current KubeVirt default configuration the VM is destroyed on panic,
but with a RunStrategy of Always it will be automatically restarted.
Failed phase when a non-recoverable panic is detected.RunStrategy: Always, it is automatically restarted.Confirm the alert labels (namespace, name) in Alertmanager or the
monitoring console and set variables for the following steps:
export NAMESPACE="<alert namespace label>"
export VM_NAME="<alert name label>"
Check VMI phase and events
kubectl get vmi -n $NAMESPACE $VM_NAME -o wide
kubectl describe vmi -n $NAMESPACE $VM_NAME
Look for GuestPanicked or Stopped events.
Inspect the guest OS panic metric (includes panic type and bugcheck code):
kubevirt_vmi_guest_os_panic_total{namespace="$NAMESPACE", name="$VM_NAME"}
Check the alert expression to see the number of panics in the last 24 hours:
sum by (namespace, name) (increase(kubevirt_vmi_guest_os_panic_total{namespace="$NAMESPACE", name="$VM_NAME"}[24h]))
Review virt-launcher logs on the node where the VMI ran:
POD=$(kubectl get pod -n $NAMESPACE -l kubevirt.io/domain=$VM_NAME -o name | head -n1)
kubectl describe $POD -n $NAMESPACE
kubectl logs $POD -n $NAMESPACE -c compute --previous
kubectl logs $POD -n $NAMESPACE -c compute
For Windows guests, collect crash dumps or event logs from inside the
guest after the VM is running again. The bugcheck_code label on the metric
provides the Windows BSOD code.
Failed and policy allows.pvpanic for Linux, hyperv for Windows) to enable reliable crash
detection.If you cannot resolve the issue, see the following resources: