How to avoid pods of the same Deployment to be scheduled on the same node

2 min read

For some applications we might want to avoid having two or more Pods belonging to the same Deployment to be scheduled on different nodes, yet we don't need them to be a DaemonSet. Let's use as an example the cluster autoscaler: We would like to have two replicas but not on the same node, since if we are draining the node an there's not enough capacity on the other nodes with both Pods offline a manual intervention would be required to spawn a new node

$ kubectl get pods -n autoscaler -o wide
NAME                                                 READY   STATUS    RESTARTS   AGE     IP              NODE                                           NOMINATED NODE   READINESS GATES
autoscaler-aws-cluster-autoscaler-585cc546dd-jc46d   1/1     Running   0          16h     10.103.195.47   ip-10-12-16-10.eu-west-1.compute.internal    <none>           <none>
autoscaler-aws-cluster-autoscaler-585cc546dd-s4j2r   1/1     Running   0          16h     10.103.195.147  ip-10-12-16-10.eu-west-1.compute.internal    <none>           <none>

To do so we will have to configure affinity

The affinity for a Pod is spec.affinity, so on a Deployment it would go on the pod template thus spec.template.spec.affinity.

If we are using a Helm chart we will have to check if it's possible to set it. For example, we can set it for the cluster autoscaler by setting the following values:

affinity:
  podAntiAffinity:                                 
    requiredDuringSchedulingIgnoredDuringExecution:
    - topologyKey: kubernetes.io/hostname
      labelSelector:                               
        matchLabels:                               
          app.kubernetes.io/name: aws-cluster-autoscaler

This means that this podAntiAffinity is required (requiredDuringSchedulingIgnoredDuringExecution) based the node label kubernetes.io/hostname, grouping pods using the label app.kubernetes.io/name that it's value is aws-cluster-autoscaler

So, this means that when it is trying to schedule a new Pod with app.kubernetes.io/name=aws-cluster-autoscaler, it will select a node that it's label kubernetes.io/hostname is not already owned by another Pod of this same group.

By applying this settings we will be able to see how the autoscaler Pods are no longer scheduled on the very same node:

$ kubectl get pods -n autoscaler -o wide
NAME                                                 READY   STATUS    RESTARTS   AGE     IP              NODE                                           NOMINATED NODE   READINESS GATES
autoscaler-aws-cluster-autoscaler-77f6d6cf75-8srd7   1/1     Running   0          15m     10.103.195.19   ip-10-12-16-10.eu-west-1.compute.internal    <none>           <none>
autoscaler-aws-cluster-autoscaler-77f6d6cf75-v6jg5   1/1     Running   0          4m23s   10.103.199.47   ip-10-12-16-144.eu-west-1.compute.internal   <none>           <none>

Posted on 11/08/2021