Our Outrageous Decision to Orchestrate Nomad with Kubernetes

“Write less code.”

Rob Zuber, CTO of CircleCI, stepped back to observe this directive, written in faded scarlet on a whiteboard. Like all important directives, this one was pithy and counterintuitive; how could we possibly hope to build CircleCI 2.0 by doing less?

If you’ve read this blog post about our release philosophy, you’ll remember that replatforming was a pretty frightening prospect for us. One of the reasons it was so frightening was that we were starting from scratch; there’s a real danger in overdesigning a product when you do that. But there’s also an opportunity to take stock of all the new options — the great work that’s been done to advance the toolset. This story isn’t really about creating the prototype of CircleCI 2.0: it’s about scaling that prototype. It’s about how we, the SRE team, took on a big project with little time. Our mission: get our platform to grow up by making the new CircleCI highly available (HA). And we had to move fast because every day spent not continuously delivering features was causing us substantial pain.

This is a prequel to the release of CircleCI 2.0. We’re going to show you how we turned a proof of concept into a truth of concept — without writing more code and tooling than we needed. We accomplished this feat of brevity through the aid of two players: Nomad and Kubernetes.

Why We Used Nomad

On a glorious spring day in Toronto, the platform team proposed a new build system, more widely known as CircleCI 2.0. They knew they needed a proper scheduling solution for all those jobs, so they spread the options across the table, like ducks. Mesos, Nomad, Kubernetes… a band of mercenary orchestration systems, each vying for attention.

We had some Requirements, though, and that made the decision a little easier.

A Good Scheduler is Hard to Find

We run people’s code. And we run people’s code on the same server as other people’s code. So, we had to bake proper isolation into CircleCI from the beginning; it couldn’t have been tacked on at the last minute. There were only a few ways we could do isolation in a reasonable way, and the one we picked was LXC.

Over the years, we’ve gotten pretty adept at managing the nuances of LXC and its tools — we know what output they generate, and we understand the edge cases. The platform team wanted to capitalize on that expertise by picking a tool that shared many of these qualities, leaving them with Mesos and Nomad.

Why Not Mesos?

At first, Mesos seemed like the clear winner: it’s battle-tested and operating at scale at companies like Twitter and Airbnb. But part of its ability to support these huge organizations rests on some substantial overhead; specifically, requiring frameworks like Marathon or Chronos. Depending on the framework, you’ll get a different flavor of Mesos: Marathon focuses on services, while Chronos specializes in cron jobs. The platform team needed a framework designed for batch jobs, and Mesos didn’t have a great solution in place.

Furthermore, operating a Mesos fleet is A Whole Thing, requiring extra setup and engineers we couldn’t spare. The out-of-the-box config was insecure for untrusted code, and it didn’t offer much of the isolation we wanted. Don’t get us wrong: Mesos is an exceptional tool, but it’s also an exceptionally beefy tool. The platform team didn’t have the time to fence off the beast and opted for something a little tamer.

Write less code.

Nomad, No Problems (almost)

Nomad is much leaner and more user-friendly. It’s wonderfully fast at scheduling jobs, has a great range of containerization options, and also boasts a fairly cooperative API. All of this is key when trying to pay that flexibility forward to our customers. Additionally, Nomad was still evolving, so we had some say in its direction; we issued patches while making the new build system and pushed the tool to its limits.

Operating Nomad did have some special considerations. Rolling out updates to Nomad servers required keeping quorum intact, which made most provisioning tools risky. Our initial approach was to go into each server to update and restart it. There were also cases where the Nomad servers would become unresponsive; someone would have to detect that, access the server and manually restart the Nomad agent. This level of manual intervention was obviously not scalable and factored heavily into…

Why We Used Kubernetes

In its larval form, CircleCI 2.0 wasn’t equipped to deal with customers at scale: all the new core services ran on one lonely server. This was a single point of failure and would have led to outages and engineer tears — not really acceptable for a mature product.

So our task was to take everything that had been built into The Product (including Nomad), automate it, scale it, and make everything HA. This was a hard requirement for exiting the prototype phase and entering our customers’ lives. Since this was already an uncomfortably long project, our decision process boiled down to a few factors:

Maturity, Community, and Experience

There were several candidates for CircleCI 2.0’s orchestration system (a distinctly separate concern from scheduling jobs): Docker Swarm, Mesos, ECS, Nomad, and Kubernetes. Many of these were still in their infancies and not quite ready for prime time. Kubernetes was a different story: it was stable, used in several production environments, and has a deliciously active community. Heck, even we had used it in previous jobs, so we weren’t plunging into unknown territory.

We can’t stress how important this is. The lifespan and adoption of a tool give us confidence that others have encountered similar problems and (perhaps) know how to solve them. We want to be able to go to an IRC channel, present an issue, and hear someone suggest a solution — or at least know about the issue. Maybe they haven’t completely fixed it, but getting visibility across an entire system is much less intimidating when you have an army of thinkers.

And this was one of the main reasons we didn’t use Mesos: so much of that production experience is hidden behind corporate firewalls. No one on our team knew how to use it, and we wouldn’t have been able to fill in those knowledge gaps without a community discussing the gritty details.

The allure of a shiny New Thing is dangerously tantalizing, but we resisted that pull and opted for a tool already in use across other production environments. The bigger the community, the more likely it is that we’ll find our answers within that community; a thousand companies using a tool will find edge cases much faster than ten companies will.

One Size Should Not Fit All

Some clever readers may be thinking, “Hey, you already figured out how to use Nomad; why not save some time and leverage your knowledge by using it again to orchestrate the whole product? Wouldn’t that reduce complexity?”

Fair questions. While Nomad was great for “single-purpose” scheduling of build containers, Kubernetes has an arsenal of features, including robust health checking, configurable deployment strategies, distributed networking, secrets and operations tooling that other orchestration systems lacked.

We’ve also got a policy of strict isolation between our own infrastructure and our customers’ infrastructure. While this does reduce leverage, it means we’re free to choose the right tool for the job. We had some tight constraints that led to choosing Nomad, but we didn’t let that decision trigger any tunnel vision — we never felt obligated to use Nomad for everything.

Instead, we took a breath and a hard look at our goals. We wanted the system we chose to be purposeful and composable, and that meant spending time exploring other options. And this is something we could only do because we’re at a certain scale. In a smaller shop, it would make sense to have either a Nomad expert or a Kubernetes expert — but not both.

But we’re large enough that we can let teams pick the right tools for the job. And we’re also large enough that we’re required to pick the right tools for the job: when you’re running a large-scale distributed system, every little bit counts.

Handled Nomad’s Quirks

Kubernetes helped us address Nomad’s need for manual intervention. If we wanted to update our version of Nomad, we could push it to Kubernetes, which would only try to update the first server. If that failed, Kubernetes would stop the deploy and keep the remaining four running so everything could be safely rolled back.

As for the wedging issue, Kubernetes’ health checks proved invaluable. The readiness probe could check a given HTTP endpoint and ensure that it was accessible before updating, and the liveness probe could check that a server is… live, and restart it if not. So Kubernetes gave us autonomous, self-actualized containers that could auto-heal if they became unhealthy.

Write Less Code, Use More Tools

So, using Kubernetes to orchestrate Nomad really isn’t so outrageous. It’s a matter of perspective: once we landed on Nomad as the solution to our scheduling problem, we considered it to be part of the product; Kubernetes was the solution for scaling the product.

This distinction is important because it affects how we approached each problem. As soon as Nomad fulfilled its destiny as a job scheduler, we stopped treating it as an orchestration system; instead, it just became one of several components in CircleCI 2.0. And that meant we could tackle the problem of making CircleCI 2.0 highly available separately from the problem of scheduling jobs.

All of this was in the spirit of writing less code. By breaking these problems into digestible pieces, we were able to pick the right tool for each scenario, instead of picking one orchestration system and using it everywhere. It’s a liberating feeling, and we highly recommend it. The glorious conclusion to this story ends with the cathartic deletion of all the custom orchestration code that the original CircleCI relies on — and we very much look forward to that day.