Note from the publisher: This content was updated by Senior Devops Customer Engineer Ben Van Houten on 9/3/2020.


Amazon Web Services AWS logo

Amazon’s Auto Scaling groups (ASG) are, in theory, a great way to scale. The idea is that you give them a desired capacity, and the knowledge of how to launch more machines, and they will fully automate spinning your fleet up and down as the desired capacity changes. Unfortunately, in practice, there are a couple key reasons that we can’t use them to manange our CircleCI.com fleet, one of the most important being that the default ASG termination policy kills instances too quickly. Since our instances are running builds for our customers, we can’t simply kill them instantly. We must wait for all builds to finish before we can terminate an instance.

Graceful shutdown, where you must wait for some condition before terminating an instance, isn’t a problem unique to us. But it is a problem that gets solved differently depending on the size of your fleet and your usage pattern. For example, we use a custom system to scale our CircleCI.com fleet because the load on CircleCI.com follows a largely predictable usage pattern based on when our customers are at work around the world, with a little randomness mixed in. The simple metrics-based scaling policies that ASGs provide aren’t quite sufficient to model this.

But, with the release of CircleCI Enterprise, we’ve realized that asking each of our customers to invent their own fully custom scaling system doesn’t make sense when their fleet size and load requirements are often much simpler. So, we wanted to see if there was a way we could create a simple solution that covered the common scaling patterns of most of our customers, while still allowing them to customize things more if the basics weren’t enough.

The CircleCI 1.x solution

Terraform icon

At a high level, our solution includes four parts:

  1. An ASG to manage EC2 instances.
  2. An Auto Scaling lifecycle hook to publish notifications when it’s time to shut an instance down and will put the server in “Terminating: WAIT” state and it triggers an SNS notification.
  3. An SNS Topic to trigger the Lambda Function to implement the Lifecycle hook action.
  4. A Lambda will execute “nomad node drain -enable” command through AWS SSM on the designated node. This will ensure that the nomad client that is going to terminate will not build any more jobs.
  5. The given node will be marked as ineligible to prevent new jobs from being queued and all the jobs will be wrapped up from the node for graceful termination.
  6. Once the command is successful, the node will be terminated completely by the ASG.

We use Terraform for managing our CircleCI.com AWS infrastructure, and in most of our recommended CircleCI Enterprise scripts to customers. We’re huge fans of it because it allows us to declaratively list our infrastructure in code in a way that is briefer and less error prone than CloudFormation. The resources that follow will be listed as terraform examples, but the approach will still work if you want to build it manually, or using CloudFormation. We’ve made the full Terraform file available if you’d like to try it out for yourself.

Auto Scaling groups

Rather than automating our manual scaling strategies for CircleCI Enterprise customers, we wanted to provide hooks into existing best practices. In the case of AWS, that meant Auto Scaling groups. ASGs provide a lot of flexibility. In their simplest form they act as very basic fault tolerance, automatically spinning up new machines if any of the machines stop working. They also allow you to attach a scaling schedule for easy time based scaling, or a scaling policy for reactive scaling based on metrics. For the purpose of this blog post, we’ll assume you already have an ASG set up called nomad_clients_asg.

Auto Scaling lifecycle hooks

But, as we mentioned before, ASGs don’t give you very long to terminate an instance. For us, our graceful shutdown must wait for builds to finish before it can terminate an instance, a process which can take half an hour or more. So, we turn to a relatively unknown addition to ASGs, the Lifecycle Hook.

Lifecycle Hooks allow us to get notifications from Amazon when the ASG performs certain actions. In this case, the one we care about is autoscaling:EC2_INSTANCE_TERMINATING, which tells us when the ASG is trying to terminate an instance. We chose a heartbeat_timeout of 1 hour and a default_result of CONTINUE. Since our graceful shutdown typically takes far less than an hour, this means that if something goes wrong and we’re still running after an hour, Amazon will force-terminate us.

resource "aws_autoscaling_lifecycle_hook" "graceful_shutdown_asg_hook" {
    name = "graceful_shutdown_asg"
    autoscaling_group_name = "${aws_autoscaling_group.nomad_clients_asg.name}"
    default_result = "CONTINUE"
    hearbeat_timeout = 3600
    lifecycle_transition = "autoscaling:EC2_INSTANCE_TERMINATING"
    notification_target_arn = "${aws_sns_topic.graceful_termination_topic.arn}"
    role_arn = "${aws_iam_role.autoscaling_role.arn}"
}

Amazon SNS notifications

You’ll notice in the example above, that the notification_target_arn for the LifeCycle Hook is an SQS Queue. That’s because Amazon needs somewhere to send the termination message. Rather than writing our own endpoint, we decided to let Amazon handle maintaining the state of which instances need to be terminated for us.

resource "aws_sns_queue" "graceful_termination_queue" {
  name = "graceful_termination_queue"
}

The sample code for the SNS topic itself is pretty simple, but you also need to create an IAM role and associated policy that allow the lifecycle hook to publish to the SNS topic (configured above in the role_arn section).

resource "aws_iam_role" "autoscaling_role" {
    name = "autoscaling_role"
    assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "autoscaling.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF
}

resource "aws_iam_role_policy" "lifecycle_hook_autoscaling_policy" {
    name = "lifecycle_hook_autoscaling_policy"
    role = "${aws_iam_role.autoscaling_role.id}"
    policy = <<EOF
{
    "Version": "2012-10-17",
    "Statement": [
        {
          "Sid": "",
          "Effect": "Allow",
          "Action": [
              "sns:Publish",
          ],
          "Resource": [
             "*"
          ]
        }
    ]
}
EOF
}

Using a SNS topic here was the right solution for us because our architecture allows for multiple workers to consume the shutdown notifications. If your architecture needs a leader that can shutdown any of its followers, thus we only have a single consumer, then you’ll want to look into using SQS Queue (Amazon’s pub-sub as a service) here instead.

The graceful shutdown

The only part left is the actual consumption from the queue, and the graceful shutdown of the machine. Depending on how you do graceful shutdown, your solution to consuming the queue may look very different than ours. But you’ll still need to know a couple things about what the notifications in the queue looks like.

First, Amazon sends a test notification when you first attach a Lifecycle Hook to make sure the connection works properly:

{
  "AutoScalingGroupName":"example_asg",
  "Service":"AWS Auto Scaling",
  "Time":"2016-02-26T21:06:40.843Z",
  "AccountId":"some-account-id",
  "Event":"autoscaling:TEST_NOTIFICATION",
  "RequestId":"some-request-id-1",
  "AutoScalingGroupARN":"some-arn"
}

You can safely ignore the message in your consumer.

The actual shutdown notifications are the only other piece of data you’ll need to know the shape of, and they look like this:

{
  "AutoScalingGroupName":"example_asg",
  "Service":"AWS Auto Scaling",
  "Time":"2016-02-26T21:09:59.517Z",
  "AccountId":"some-account-id",
  "LifecycleTransition":"autoscaling:EC2_INSTANCE_TERMINATING",
  "RequestId":"some-request-id-2",
  "LifecycleActionToken":"some-token",
  "EC2InstanceId":"i-nstanceId",
  "LifecycleHookName":"graceful_shutdown_asg"
}

Note: This particular solution is only available for those on CircleCI Enterprise 2.0.

How you consume these notifications is entirely up to you. In our case, we just added a Node.js script runner inside of a Lambda function that our customers can customize. We did it this way because we wanted to give our customers something pre-packaged, and easy to configure.

AWS Lambda function and SSM agent document

Our Lambda function utilizes the AWS Javascript SDK to leverage the System Service Management Agent and communicate with the specified AutoScaling Group. The AWS SSM Agent is Amazon software that can be installed and configured on an EC2 instance, if you are utilizing our base enterprise setup terraform script CircleCI utilizes Amazon’s Linux AMIs that will have the AWS SSM Agent pre-installed.

In order to continue with our solution you will need to upload a SSM Document that allows you to define what actions you want System Manager to perform on your AWS resources, which in this case is our nomad clients. For this solution the SSM Document will look like this below, which executes the nomad drain command then checks to see if the node is eligible:

{
   "schemaVersion": "1.2",
   "description": "Draining Node",
   "parameters":{
     "nodename":{
       "type":"String",
       "description":"Specify the Node name to drain"
     }
   },
   "runtimeConfig": {
     "aws:runShellScript": {
       "properties": [
         {
           "id": "0.aws:runShellScript",
           "runCommand": [
             "#!/bin/bash",
             "nomad node drain -enable -self -y",
             "isEligible=$(nomad node-status -self -json | jq '.SchedulingEligibility | contains (\"ineligible\")')",
             "if (( ${isEligible} == true )) ; then exit 0 ; else exit 129; fi"
           ]
         }
       ]
     }
   }
 }

You can save the file as document.json and execute the below command to create it in SSM:

aws ssm create-document --content "file://drain-document.json" --name "CircleCiDrainNodes" --document-type "Command"

Lastly, you will need to create a Lambda function to process the lifecycle hook. Download the Lambda code to your local workstation and install the code dependencies. While installing, ensure that you are using the same version as of Lambda and customize the Lambda script to match your environment.

Zip the code and create a Lambda function (or you can refer to our zip file here).

Conclusion

Auto scaling groups have long been a great choice for managing scaling, because they offer such flexibility in how you scale. And, with the addition of Lifecycle Hooks, they also provide you flexibility in how you terminate, allowing for graceful shutdown.

That meant that when we were looking for a solution to allow CircleCI Enterprise customers to scale their fleets with minimal Ops team overhead, ASGs with lifecycle hooks were a great choice. They provided a plug-and-play solution that provided immediate value, but also allowed for advanced customization on the part of our customers in how they actually scale their fleet.