Fine tunning Pod scheduling using taints, tolerations and nodeSelector

4 min read

If we want just a subset of Pods to be able to be scheduled on a given node we can achieve it using taints and tolerations

With a taint we can tell the cluster not to schedule Pods on this node, but with a toleration on a Pod we can allow it to tolerate this taint

First we are going to create a taint on a node:

$ kubectl taint nodes minikube-m02 application=example:NoSchedule
node/minikube-m02 tainted

Using kubect describe node we will be able to see that it have been applied:

$ kubectl describe node minikube-m02

Name:               minikube-m02
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=minikube-m02
                    kubernetes.io/os=linux
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 19 Aug 2021 18:14:37 +0200
Taints:             node.kubernetes.io/not-ready:NoExecute
                    application=example:NoSchedule
                    node.kubernetes.io/not-ready:NoSchedule
Unschedulable:      false
(...)

We can use a nodeSelector to try to schedule a Pod on this node:

apiVersion: v1
kind: Pod
metadata:
  name: example
spec:
  containers:
  - name: nginx
    image: nginx
  nodeSelector:
    kubernetes.io/hostname: minikube-m02

But the node will remain in Pending state:

$ kubectl get pods
NAME      READY   STATUS    RESTARTS   AGE
example   0/1     Pending   0          3s

We can check the reason using the kubectl describe: The only node that matches the nodeSelector has a taint that does not tolerate, so it cannot be scheduled there:

$ kubectl describe pod example
Name:         example
Namespace:    default
Priority:     0
Node:         <none>
Labels:       <none>
Annotations:  <none>
Status:       Pending
IP:           
IPs:          <none>
Containers:
  nginx:
    Image:        nginx
    Port:         <none>
    Host Port:    <none>
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-f7bff (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-api-access-f7bff:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              application=example
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  11s (x2 over 13s)  default-scheduler  0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had taint {application: example}, that the pod didn't tolerate, 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate.

We can add a toleration on the Pod for the taint that we have created:

apiVersion: v1
kind: Pod
metadata:
  name: example
spec:
  containers:
  - name: nginx
    image: nginx
  nodeSelector:
    kubernetes.io/hostname: minikube-m02
  tolerations:
  - key: "application"
    operator: "Equal"
    value: "example"
    effect: "NoSchedule"

If we create this Pod we will be able to see how it is scheduled to run on this node, ignoring (tolerating) it's taint:

$ kubectl describe pod example
Name:         example
Namespace:    default
Priority:     0
Node:         minikube-m02/192.168.49.3
Start Time:   Thu, 19 Aug 2021 19:01:54 +0200
Labels:       <none>
Annotations:  <none>
Status:       Pending
IP:           
IPs:          <none>
Containers:
  nginx:
    Container ID:   
    Image:          nginx
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7z5w8 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-7z5w8:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/hostname=minikube-m02
Tolerations:                 application=example:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  15s   default-scheduler  Successfully assigned default/example to minikube-m02
  Normal  Pulling    11s   kubelet            Pulling image "nginx"

Posted on 20/08/2021