AWS logo

Background

Amazon’s Auto Scaling groups (ASG) are, in theory, a great way to scale. The idea is that you give them a desired capacity, and the knowledge of how to launch more machines, and they will fully automate spinning your fleet up and down as the desired capacity changes. Unfortunately, in practice, there are a couple key reasons that we can’t use them to manange our CircleCI.com fleet, one of the most important being that the default ASG termination policy kills instances too quickly. Since our instances are running builds for our customers, we can’t simply kill them instantly. We must wait for all builds to finish before we can terminate an instance.

Graceful shutdown, where you must wait for some condition before terminating an instance, isn’t a problem unique to us. But it is a problem that gets solved differently depending on the size of your fleet and your usage pattern. For example, we use a custom system to scale our CircleCI.com fleet because the load on CircleCI.com follows a largely predictable usage pattern based on when our customers are at work around the world, with a little randomness mixed in. The simple metrics-based scaling policies that ASGs provide aren’t quite sufficient to model this.

But, with the recent release of CircleCI Enterprise, we’ve realized that asking each of our customers to invent their own fully custom scaling system doesn’t make sense when their fleet size and load requirements are often much simpler. So, we wanted to see if there was a way we could create a simple solution that covered the common scaling patterns of most of our customers, while still allowing them to customize things more if the basics weren’t enough.

The solution

Terraform Logo

At a high level, our solution includes four parts:

  1. An ASG to manage EC2 instances.
  2. An Auto Scaling lifecycle hook to publish notifications when it’s time to shut an instance down.
  3. An SQS queue to hold those notifications until we’re ready to gracefully shutdown the machine.
  4. A worker to process the queue on a regular interval, and send graceful shutdown commands to the appropriate instances. By default this was installed on the Master machine, but in principle could be done anywhere, even as an AWS Lambda function.

We use Terraform for managing our CircleCI.com AWS infrastructure, and in most of our recommended CircleCI Enterprise scripts to customers. We’re huge fans of it because it allows us to declaratively list our infrastructure in code in a way that is briefer and less error prone than CloudFormation. The resources that follow will be listed as terraform examples, but the approach will still work if you want to build it manually, or using CloudFormation. We’ve made the full Terraform file available if you’d like to try it out for yourself.

Auto Scaling Groups

Rather than automating our manual scaling strategies for CircleCI Enterprise customers, we wanted to provide hooks into existing best practices. In the case of AWS, that meant Auto Scaling groups. ASGs provide a lot of flexibility. In their simplest form they act as very basic fault tolerance, automatically spinning up new machines if any of the machines stop working. They also allow you to attach a scaling schedule for easy time based scaling, or a scaling policy for reactive scaling based on metrics. For the purpose of this blog post, we’ll assume you already have an ASG set up called example_asg.

Auto Scaling Lifecycle Hooks

But, as we mentioned before, ASGs don’t give you very long to terminate an instance. For us, our graceful shutdown must wait for builds to finish before it can terminate an instance, a process which can take half an hour or more. So, we turn to relatively unknown addition to ASGs, the Lifecycle Hook.

Lifecycle Hooks allow us to get notifications from Amazon when the ASG performs certain actions. In this case, the one we care about is autoscaling:EC2_INSTANCE_TERMINATING, which tells us when the ASG is trying to terminate an instance. We chose a heartbeat_timeout of 1 hour and a default_result of CONTINUE. Since our graceful shutdown typically takes far less than an hour, this means that if something goes wrong and we’re still running after an hour, Amazon will force-terminate us.

resource "aws_autoscaling_lifecycle_hook" "graceful_shutdown_asg_hook" {
    name = "graceful_shutdown_asg"
    autoscaling_group_name = "${aws_autoscaling_group.example_asg.name}"
    default_result = "CONTINUE"
    hearbeat_timeout = 3600
    lifecycle_transition = "autoscaling:EC2_INSTANCE_TERMINATING"
    notification_target_arn = "${aws_sqs_queue.graceful_termination_queue.arn}"
    role_arn = "${aws_iam_role.autoscaling_role.arn}"
}

Amazon SQS Queues

You’ll notice in the example above, that the notification_target_arn for the LifeCycle Hook is an SQS Queue. That’s because Amazon needs somewhere to send the termination message. Rather than writing our own endpoint, we decided to let Amazon handle maintaining the state of which instances need to be terminated for us.

resource "aws_sqs_queue" "graceful_termination_queue" {
  name = "graceful_termination_queue"
}

The sample code for the queue itself is pretty simple, but you also need to create an IAM role and associated policy that allow the Lifecycle Hook to publish to the SQS Queue (configured above in the role_arn section).

resource "aws_iam_role" "autoscaling_role" {
    name = "autoscaling_role"
    assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "autoscaling.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF
}

resource "aws_iam_role_policy" "lifecycle_hook_autoscaling_policy" {
  name = "lifecycle_hook_autoscaling_policy"
  role = "${aws_iam_role.autoscaling_role.id}"
  policy = <<EOF
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Action": [
                "sqs:GetQueueUrl",
                "sqs:SendMessage"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}
EOF
}

Using a queue here was the right solution for us because our master can shutdown any of its slaves, thus we only have a single consumer. If your architecture needs multiple workers to consume the shutdown notifications, they you’ll want to look into using SNS (Amazon’s pub-sub as a service) here instead.

The Graceful Shutdown

The only part left is the actual consumption from the queue, and the graceful shutdown of the machine. Depending on how you do graceful shutdown, you solution to consuming the queue may look very different than ours. But you’ll still need to know a couple things about what the notifications in the queue looks like.

First, Amazon sends a test notification when you first attach a Lifecycle Hook to make sure the connection works properly:

{  
  "AutoScalingGroupName":"example_asg",
  "Service":"AWS Auto Scaling",
  "Time":"2016-02-26T21:06:40.843Z",
  "AccountId":"some-account-id",
  "Event":"autoscaling:TEST_NOTIFICATION",
  "RequestId":"some-request-id-1",
  "AutoScalingGroupARN":"some-arn"
}

You can safely ignore the message in your consumer.

The actual shutdown notifications are the only other piece of data you’ll need to know the shape of, and they look like this:

{  
  "AutoScalingGroupName":"example_asg",
  "Service":"AWS Auto Scaling",
  "Time":"2016-02-26T21:09:59.517Z",
  "AccountId":"some-account-id",
  "LifecycleTransition":"autoscaling:EC2_INSTANCE_TERMINATING",
  "RequestId":"some-request-id-2",
  "LifecycleActionToken":"some-token",
  "EC2InstanceId":"i-nstanceId",
  "LifecycleHookName":"graceful_shutdown_asg"
}

How you consume these notifications is entirely up to you. In our case, we just added a bash script wrapped inside a docker container that our customers can install on our main master machine. We did it this way because we wanted to give our customers something pre-packaged, and easy to install on the machine.

Conclusion

Auto Scaling groups have long been a great choice for managing scaling, because they offer such flexibility in how you scale. And, with the addition of Lifecycle Hooks, they also provide you flexibility in how you terminate, allowing for graceful shutdown.

That meant that when we were looking for a solution to allow CircleCI Enterprise customers to scale their fleets with minimal Ops team overhead, ASGs with lifecycle hooks were a great choice. They provided a plug-and-play solution that provided immediate value, but also allowed for advanced customization on the part of our customers in how they actually scale their fleet.