Pods are not recreated after kubernetes node failure


We had a short outage recently which was caused by the pods were not recreated on other nodes when the previous node become unresponsive. It is Kubernetes 1.6, and according to the documentation, it is expected in some cases.

“If the Status of the Ready condition is “Unknown” or “False” for longer than the pod-eviction-timeout, an argument passed to the kube-controller-manager, all of the Pods on the node are scheduled for deletion by the Node Controller. The default eviction timeout duration is five minutes. In some cases when the node is unreachable, the apiserver is unable to communicate with the kubelet on it. The decision to delete the pods cannot be communicated to the kubelet until it re-establishes communication with the apiserver. In the meantime, the pods which are scheduled for deletion may continue to run on the partitioned node.

In versions of Kubernetes prior to 1.5, the node controller would force delete these unreachable pods from the apiserver. However, in 1.5 and higher, the node controller does not force delete pods until it is confirmed that they have stopped running in the cluster. One can see these pods which may be running on an unreachable node as being in the “Terminating” or “Unknown” states. In cases where Kubernetes cannot deduce from the underlying infrastructure if a node has permanently left a cluster, the cluster administrator may need to delete the node object by hand. Deleting the node object from Kubernetes causes all the Pod objects running on it to be deleted from the apiserver, freeing up their names.”

To mitigate the risk of running into that problem again, I don’t see any solutions other than upgrading the cluster to Kubernetes 1.8. As there is a alpha feature in 1.8 that can potentially solves this issue. Simply speaking, it is Taint based Evictions.

For example, the node controller will automatically taint a node as unreachable when the node condition becomes unknown. Then all pods on that node which don’t tolerate this taint will evict immediately.

Version 1.8 introduces an alpha feature that automatically creates taints that represent conditions. To enable this behavior, pass an additional feature gate flag --feature-gates=...,TaintNodesByCondition=true to the API server, controller manager, and scheduler. When TaintNodesByCondition is enabled, the scheduler ignores conditions when considering a Node; instead it looks at the Node’s taints and a Pod’s tolerations.



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s