monitoring

VMNonRecoverableOSPanic

Meaning

This alert fires when a VM has experienced more than 5 non-recoverable guest OS panics in the last 24 hours. The alert is based on the kubevirt_vmi_guest_os_panic_total metric, which tracks panic events for all panic types (pvpanic, hyper-v, s390, etc.).

A non-recoverable guest OS panic indicates that the guest kernel or operating system crashed and was unable to recover on its own (e.g. via kdump). The crash is detected via the pvpanic device (Linux and Windows) or the Hyper-V enlightenment mechanism (Windows), and reported by QEMU/libvirt to KubeVirt. In the current KubeVirt default configuration the VM is destroyed on panic, but with a RunStrategy of Always it will be automatically restarted.

Impact

Diagnosis

  1. Confirm the alert labels (namespace, name) in Alertmanager or the monitoring console and set variables for the following steps:

    export NAMESPACE="<alert namespace label>"
    export VM_NAME="<alert name label>"
    
  2. Check VMI phase and events

    kubectl get vmi -n $NAMESPACE $VM_NAME -o wide
    kubectl describe vmi -n $NAMESPACE $VM_NAME
    

    Look for GuestPanicked or Stopped events.

  3. Inspect the guest OS panic metric (includes panic type and bugcheck code):

    kubevirt_vmi_guest_os_panic_total{namespace="$NAMESPACE", name="$VM_NAME"}
    
  4. Check the alert expression to see the number of panics in the last 24 hours:

    sum by (namespace, name) (increase(kubevirt_vmi_guest_os_panic_total{namespace="$NAMESPACE", name="$VM_NAME"}[24h]))
    
  5. Review virt-launcher logs on the node where the VMI ran:

    POD=$(kubectl get pod -n $NAMESPACE -l kubevirt.io/domain=$VM_NAME -o name | head -n1)
    kubectl describe $POD -n $NAMESPACE
    kubectl logs $POD -n $NAMESPACE -c compute --previous
    kubectl logs $POD -n $NAMESPACE -c compute
    
  6. For Windows guests, collect crash dumps or event logs from inside the guest after the VM is running again. The bugcheck_code label on the metric provides the Windows BSOD code.

Mitigation

If you cannot resolve the issue, see the following resources: