HAControlPlaneDown
Meaning
A control plane node has been detected as not ready for more than 5 minutes.
Impact
When a control plane node is down, it affects the high availability and
redundancy of the Kubernetes control plane. This can negatively impact:
- API server availability
- Controller manager operations
- Scheduler functionality
- etcd cluster health (if etcd is co-located)
Diagnosis
- Check the status of all control plane nodes:
kubectl get nodes -l node-role.kubernetes.io/control-plane=''
- Get detailed information about the affected node:
kubectl describe node <node-name>
- Review system logs on the affected node:
Mitigation
- Check node resources:
- Verify CPU, memory, and disk usage
# Check the node's CPU and memory resource usage
kubectl top node <node-name>
# Check node status conditions for DiskPressure status
kubectl get node <node-name> -o yaml | jq '.status.conditions[] | select(.type == "DiskPressure")'
- Clear disk space if necessary
- Restart the kubelet if resource issues are resolved
- If the node is unreachable:
- Verify network connectivity
- Check physical/virtual machine status
- Ensure the node has power and is running
- If the kubelet is generating errors:
systemctl restart kubelet
- If the node cannot be recovered:
- If possible, safely drain the node
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
- Investigate hardware/infrastructure issues
- Consider replacing the node if necessary
Additional notes
- Maintain at least three control plane nodes for high availability
- Monitor etcd cluster health if the affected node runs etcd
- Document any infrastructure-specific recovery procedures
If you cannot resolve the issue, see the following resources: