5 min read | by Jordi Prats
If you run Kubernetes workloads on AWS you want to make sure Pods are spread across all the available availability zones. To do so we can use podAntiAffinity to tell Kubernetes to avoid deploying all the Pods of the same deployment on the same AZ
To do so we will have to first check the nodes labels to pick up a suitable label:
$ kubectl describe node ip-10-12-100-194.eu-west-1.compute.internal
Name: ip-10-12-100-194.eu-west-1.compute.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=m5a.large
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=eu-west-1
failure-domain.beta.kubernetes.io/zone=eu-west-1c
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-10-12-100-194.eu-west-1.compute.internal
kubernetes.io/os=linux
node.kubernetes.io/instance-type=m5a.large
pti/eks-workers-group-name=system
pti/lifecycle=ondemand
topology.ebs.csi.aws.com/zone=eu-west-1c
topology.kubernetes.io/region=eu-west-1
topology.kubernetes.io/zone=eu-west-1c
vpc.amazonaws.com/has-trunk-attached=true
Annotations: csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-62db780c1e5000ac0"}
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
(...)
In fact, there are several that we could use, for this example we are going to use topology.kubernetes.io/zone.
To configure Pod affinity there are two major ways of setting the affinity:
If we apply a hard affinity like follows using one of the Pod's labels to identify the group of Pods that we want to keep on different AZ:
apiVersion: apps/v1
kind: Deployment
metadata:
name: aztest
spec:
replicas: 1
selector:
matchLabels:
component: aztest
template:
metadata:
labels:
component: aztest
spec:
containers:
- name: aztest
image: "alpine:latest"
command:
- sleep
- '24h'
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: component
operator: In
values:
- aztest
topologyKey: topology.kubernetes.io/zone
If we deploy it, we will be able to create 3 replicas that will be spread across the tree different AZ that this region has:
$ kubectl apply -f aztest-hard.yaml
deployment.apps/aztest created
$ kubectl scale deploy aztest --replicas=3
deployment.apps/aztest scaled
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
aztest-b48c988cc-n7pfz 1/1 Running 0 17s 10.12.100.163 ip-10-12-100-40.eu-west-2.compute.internal <none> <none>
aztest-b48c988cc-pf2pb 1/1 Running 0 11s 10.12.101.22 ip-10-12-101-92.eu-west-2.compute.internal <none> <none>
aztest-b48c988cc-w2kwh 1/1 Running 0 11s 10.12.102.67 ip-10-12-102-167.eu-west-2.compute.internal <none> <none>
However, if we try to spawn the forth replica if will go in Pending state: This is because, since there are no more AZ and there are already Pods on each of the AZs it cannot schedule another Pod because the podAntiAffinity is telling the scheduler that it cannot schedule Pods on an node that already has one Pod of this group:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
aztest-b48c988cc-2hzs9 0/1 Pending 0 7s <none> <none> <none> <none>
aztest-b48c988cc-n7pfz 1/1 Running 0 46s 10.12.100.163 ip-10-12-100-40.eu-west-2.compute.internal <none> <none>
aztest-b48c988cc-pf2pb 1/1 Running 0 40s 10.12.101.22 ip-10-12-101-92.eu-west-2.compute.internal <none> <none>
aztest-b48c988cc-w2kwh 1/1 Running 0 40s 10.12.102.67 ip-10-12-102-167.eu-west-2.compute.internal <none> <none>
$ kubectl describe pod aztest-b48c988cc-2hzs9
Name: aztest-b48c988cc-2hzs9
Namespace: test
Priority: 0
Node: <none>
Labels: component=aztest
pod-template-hash=b48c988cc
(...)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 19s (x2 over 20s) default-scheduler 0/14 nodes are available: 11 node(s) didn't match pod affinity/anti-affinity rules, 11 node(s) didn't match pod anti-affinity rules, 3 node(s) had taint {pti/role: system}, that the pod didn't tolerate.
Normal NotTriggerScaleUp 12s cluster-autoscaler pod didn't trigger scale-up: 3 node(s) didn't match pod affinity/anti-affinity rules, 3 node(s) didn't match pod anti-affinity rules, 3 node(s) had taint {pti/role: system}, that the pod didn't tolerate
To be able to schedule more than four Pods while distributing them across the available AZs we will have to use a soft affinity rule as follows:
apiVersion: apps/v1
kind: Deployment
metadata:
name: aztest
spec:
replicas: 1
selector:
matchLabels:
component: aztest
template:
metadata:
labels:
component: aztest
spec:
containers:
- name: aztest
image: "alpine:latest"
command:
- sleep
- '24h'
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: component
operator: In
values:
- aztest
topologyKey: topology.kubernetes.io/zone
weight: 100
If we try to create four replicas with this configuration, it will spread replicas across the available nodes but if it is not possible it will schedule it on one of the nodes ignoring this rule:
$ kubectl apply -f aztest-soft.yaml
deployment.apps/aztest configured
$ kubectl scale deploy aztest --replicas=4
deployment.apps/aztest scaled
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
aztest-68cb665b7-22f6m 1/1 Running 0 9s 10.12.101.214 ip-10-12-101-244.eu-west-2.compute.internal <none> <none>
aztest-68cb665b7-47clb 1/1 Running 0 18s 10.12.101.43 ip-10-12-101-92.eu-west-2.compute.internal <none> <none>
aztest-68cb665b7-8qnz6 1/1 Running 0 9s 10.12.103.84 ip-10-12-103-167.eu-west-2.compute.internal <none> <none>
aztest-68cb665b7-lj5qm 1/1 Running 0 9s 10.12.102.89 ip-10-12-102-167.eu-west-2.compute.internal <none> <none>
Posted on 28/03/2022