monitoring

HAControlPlaneDown

Meaning

A control plane node has been detected as not ready for more than 5 minutes.

Impact

When a control plane node is down, it affects the high availability and redundancy of the Kubernetes control plane. This can negatively impact:

API server availability
Controller manager operations
Scheduler functionality
etcd cluster health (if etcd is co-located)

Diagnosis

Check the status of all control plane nodes:

kubectl get nodes -l node-role.kubernetes.io/control-plane=''

Get detailed information about the affected node:
```
kubectl describe node <node-name>
```

Review system logs on the affected node:

ssh <node-address>

journalctl -xeu kubelet

Mitigation

Check node resources:

Verify CPU, memory, and disk usage

 # Check the node's CPU and memory resource usage
 kubectl top node <node-name>

 # Check node status conditions for DiskPressure status
 kubectl get node <node-name> -o yaml | jq '.status.conditions[] | select(.type == "DiskPressure")'

Clear disk space if necessary
Restart the kubelet if resource issues are resolved

If the node is unreachable:
- Verify network connectivity
- Check physical/virtual machine status
- Ensure the node has power and is running

If the kubelet is generating errors:

systemctl status kubelet

systemctl restart kubelet

If the node cannot be recovered:
- If possible, safely drain the node
```
 kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
```
- Investigate hardware/infrastructure issues
- Consider replacing the node if necessary

Additional notes

Maintain at least three control plane nodes for high availability
Monitor etcd cluster health if the affected node runs etcd
Document any infrastructure-specific recovery procedures

If you cannot resolve the issue, see the following resources:

This site is open source. Improve this page.