monitoring

HAControlPlaneDown

Meaning

A control plane node has been detected as not ready for more than 5 minutes.

Impact

When a control plane node is down, it affects the high availability and redundancy of the Kubernetes control plane. This can negatively impact:

Diagnosis

  1. Check the status of all control plane nodes:
    kubectl get nodes -l node-role.kubernetes.io/control-plane=''
    
  2. Get detailed information about the affected node:
    kubectl describe node <node-name>
    
  3. Review system logs on the affected node:
    ssh <node-address>
    
    journalctl -xeu kubelet
    

Mitigation

  1. Check node resources:
    • Verify CPU, memory, and disk usage
       # Check the node's CPU and memory resource usage
       kubectl top node <node-name>
      
       # Check node status conditions for DiskPressure status
       kubectl get node <node-name> -o yaml | jq '.status.conditions[] | select(.type == "DiskPressure")'
      
    • Clear disk space if necessary
    • Restart the kubelet if resource issues are resolved
  2. If the node is unreachable:
    • Verify network connectivity
    • Check physical/virtual machine status
    • Ensure the node has power and is running
  3. If the kubelet is generating errors:
    systemctl status kubelet
    
    systemctl restart kubelet
    
  4. If the node cannot be recovered:
    • If possible, safely drain the node
       kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
      
    • Investigate hardware/infrastructure issues
    • Consider replacing the node if necessary

Additional notes

If you cannot resolve the issue, see the following resources: