Kubernetes: How to configure Deployment to evenly spread Pods across availability zones

5 min read | by Jordi Prats

If you run Kubernetes workloads on AWS you want to make sure Pods are spread across all the available availability zones. To do so we can use podAntiAffinity to tell Kubernetes to avoid deploying all the Pods of the same deployment on the same AZ

To do so we will have to first check the nodes labels to pick up a suitable label:

$ kubectl describe node ip-10-12-100-194.eu-west-1.compute.internal
Name:               ip-10-12-100-194.eu-west-1.compute.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m5a.large
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=eu-west-1
                    failure-domain.beta.kubernetes.io/zone=eu-west-1c
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-12-100-194.eu-west-1.compute.internal
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=m5a.large
                    pti/eks-workers-group-name=system
                    pti/lifecycle=ondemand
                    topology.ebs.csi.aws.com/zone=eu-west-1c
                    topology.kubernetes.io/region=eu-west-1
                    topology.kubernetes.io/zone=eu-west-1c
                    vpc.amazonaws.com/has-trunk-attached=true
Annotations:        csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-62db780c1e5000ac0"}
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
(...)

In fact, there are several that we could use, for this example we are going to use topology.kubernetes.io/zone.

To configure Pod affinity there are two major ways of setting the affinity:

  • Hard: Rules that are mandatory
  • Soft: More like a suggestion: The scheduler will try to apply them

If we apply a hard affinity like follows using one of the Pod's labels to identify the group of Pods that we want to keep on different AZ:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: aztest
spec:
  replicas: 1
  selector:
    matchLabels:
      component: aztest
  template:
    metadata:
      labels:
        component: aztest
    spec:
      containers:
      - name: aztest
        image: "alpine:latest"
        command:
        - sleep
        - '24h'
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: component
                operator: In
                values:
                - aztest
            topologyKey: topology.kubernetes.io/zone

If we deploy it, we will be able to create 3 replicas that will be spread across the tree different AZ that this region has:

$ kubectl apply -f aztest-hard.yaml 
deployment.apps/aztest created
$ kubectl scale deploy aztest --replicas=3
deployment.apps/aztest scaled
$ kubectl get pods -o wide
NAME                     READY   STATUS    RESTARTS   AGE   IP               NODE                                           NOMINATED NODE   READINESS GATES
aztest-b48c988cc-n7pfz   1/1     Running   0          17s   10.12.100.163    ip-10-12-100-40.eu-west-2.compute.internal     <none>           <none>
aztest-b48c988cc-pf2pb   1/1     Running   0          11s   10.12.101.22     ip-10-12-101-92.eu-west-2.compute.internal     <none>           <none>
aztest-b48c988cc-w2kwh   1/1     Running   0          11s   10.12.102.67     ip-10-12-102-167.eu-west-2.compute.internal    <none>           <none>

However, if we try to spawn the forth replica if will go in Pending state: This is because, since there are no more AZ and there are already Pods on each of the AZs it cannot schedule another Pod because the podAntiAffinity is telling the scheduler that it cannot schedule Pods on an node that already has one Pod of this group:

$ kubectl get pods -o wide
NAME                     READY   STATUS    RESTARTS   AGE   IP               NODE                                           NOMINATED NODE   READINESS GATES
aztest-b48c988cc-2hzs9   0/1     Pending   0          7s    <none>           <none>                                         <none>           <none>
aztest-b48c988cc-n7pfz   1/1     Running   0          46s   10.12.100.163    ip-10-12-100-40.eu-west-2.compute.internal     <none>           <none>
aztest-b48c988cc-pf2pb   1/1     Running   0          40s   10.12.101.22     ip-10-12-101-92.eu-west-2.compute.internal     <none>           <none>
aztest-b48c988cc-w2kwh   1/1     Running   0          40s   10.12.102.67     ip-10-12-102-167.eu-west-2.compute.internal    <none>           <none>
$ kubectl describe pod aztest-b48c988cc-2hzs9
Name:           aztest-b48c988cc-2hzs9
Namespace:      test
Priority:       0
Node:           <none>
Labels:         component=aztest
                pod-template-hash=b48c988cc
(...)
Events:
  Type     Reason             Age                From                Message
  ----     ------             ----               ----                -------
  Warning  FailedScheduling   19s (x2 over 20s)  default-scheduler   0/14 nodes are available: 11 node(s) didn't match pod affinity/anti-affinity rules, 11 node(s) didn't match pod anti-affinity rules, 3 node(s) had taint {pti/role: system}, that the pod didn't tolerate.
  Normal   NotTriggerScaleUp  12s                cluster-autoscaler  pod didn't trigger scale-up: 3 node(s) didn't match pod affinity/anti-affinity rules, 3 node(s) didn't match pod anti-affinity rules, 3 node(s) had taint {pti/role: system}, that the pod didn't tolerate

To be able to schedule more than four Pods while distributing them across the available AZs we will have to use a soft affinity rule as follows:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: aztest
spec:
  replicas: 1
  selector:
    matchLabels:
      component: aztest
  template:
    metadata:
      labels:
        component: aztest
    spec:
      containers:
      - name: aztest
        image: "alpine:latest"
        command:
        - sleep
        - '24h'
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: component
                  operator: In
                  values:
                  - aztest
              topologyKey: topology.kubernetes.io/zone
            weight: 100

If we try to create four replicas with this configuration, it will spread replicas across the available nodes but if it is not possible it will schedule it on one of the nodes ignoring this rule:

$ kubectl apply -f aztest-soft.yaml 
deployment.apps/aztest configured
$ kubectl scale deploy aztest --replicas=4
deployment.apps/aztest scaled
$ kubectl get pods -o wide
NAME                     READY   STATUS        RESTARTS   AGE     IP               NODE                                           NOMINATED NODE   READINESS GATES
aztest-68cb665b7-22f6m   1/1     Running       0          9s      10.12.101.214   ip-10-12-101-244.eu-west-2.compute.internal     <none>           <none>
aztest-68cb665b7-47clb   1/1     Running       0          18s     10.12.101.43    ip-10-12-101-92.eu-west-2.compute.internal      <none>           <none>
aztest-68cb665b7-8qnz6   1/1     Running       0          9s      10.12.103.84    ip-10-12-103-167.eu-west-2.compute.internal     <none>           <none>
aztest-68cb665b7-lj5qm   1/1     Running       0          9s      10.12.102.89    ip-10-12-102-167.eu-west-2.compute.internal     <none>           <none>

Posted on 28/03/2022