Drain spot instances that are about to be terminated using the AWS node termination handler

3 min read

If you are using a mixed policy on your EKS workers ASG you will want to install the AWS node termination handler to drain a node once AWS notifies that a particular spot instance is going to be reclaimed

To install the termination handler is this straightforward:

helm repo add eks https://aws.github.io/eks-charts
helm upgrade --install aws-node-termination-handler --namespace termination-handler eks/aws-node-termination-handler

Obviously, we can customize some settings using a values file, checking the default values from the helm chart we will find a description for each value. A good start can be the following, on which we are enabling to drain the spot instances if we get the termination notice:

enableRebalanceMonitoring: false

enableRebalanceDraining: false

enableScheduledEventDraining: ""

enableSpotInterruptionDraining: "true"

checkASGTagBeforeDraining: false

emitKubernetesEvents: true

To be able to test the termination handler we can install a fake metadata endpoint using amazon-ec2-metadata-mock. We just need to download one of the releases and install it like so:

$ helm install amazon-ec2-metadata-mock amazon-ec2-metadata-mock-1.9.1.tgz -n termination-handler

To point the termination handler to this metadata endpoint we will have to add the instanceMetadataURL to the values file pointing to Service named amazon-ec2-metadata-mock-service on the namespace we have installed it. So, using the mentioned helm install, the URL will look like this:

instanceMetadataURL: "http://amazon-ec2-metadata-mock-service.termination-handler.svc.cluster.local:1338"

Once we redeploy the termination handler with the new setting it will wait for 2 minutes before notifying the fake termination notice, so we will be able to see how it drains the node in preparation for being terminated.

The termination handler uses a DaemonSet to spawn one Pod per worker, so if we don't want to have the termination handler we can add a small script on the user_data for the ASG which will detect whether is a spot or ondemand instance and set a label accordingly:

data "template_file" "user_data_workers" {
  template = <<EOF
      #!/bin/bash
      set -o xtrace

      LIFECYCLE=$(aws ec2 describe-spot-instance-requests --filters Name=instance-id,Values="$(wget -q -O - http://169.254.169.254/latest/meta-data/instance-id)" --region "eu-west-1" | jq -r '.SpotInstanceRequests | if length > 0 then "spot" else "ondemand" end')

      /etc/eks/bootstrap.sh \
        --apiserver-endpoint '${var.cluster_endpoint}' \
        --b64-cluster-ca '${var.cluster_certificate_authority}' \
        --kubelet-extra-args "--read-only-port=10255 --node-labels=node/lifecycle=$LIFECYCLE"  \
          '${var.cluster_id}'
    EOF
}

Bear in mind that the describe-spot-instance-requests requires you to specify the region you are in, so you'll have to adjust it accordingly. Finally, we can use this label to set a nodeSelector for the termination handler so it will only schedule Pods from the DaemonSet on the instances that are actually spot instances

nodeSelector:
  node/lifecycle: "spot"

Posted on 29/09/2021