Note from the publisher: Read our updated post here: How CircleCI processes over 30 million builds per month.
Originally posted on Stackshare.
CircleCI is a platform for continuous integration and delivery. Thousands of engineers trust us to run tests and deploy their code, so they can focus on building great software. That trust rests on a solid stack of software that we use to keep people shipping and delivering value to their users.
As CTO at CircleCI, I help make the big technical decisions and keep our teams happy and out of trouble. Before this, I was CTO of Copious, where I learned a lot of important lessons about tech in service of building a consumer marketplace. I like snowboarding, Funkadelic, and viscous cappuccino.
Engineers are people. People work better in small groups. So we’ve divided our team into several functional units, inspired by Spotify’s pods. We’re much smaller, so we’ve adapted their ideas to meet our needs, while maintaining the core principle that each team has the resources they need to implement a feature across the stack.
But we think of these teams as more of a guideline than actual rules, so folks are free to move around if it means they’ll be more engaged in the work. Flexibility is a key value at CircleCI: it has to be, with the majority of our engineers working remotely across multiple time zones. To keep everyone on the same page, we use Zoom for videoconferencing and screensharing and update statuses in Pingboard to keep track of who’s “in the office”.
We use JIRA to create consistency in our processes across teams. This consistency lets us stay more nimble if engineers ever need or want to switch teams. We use GitHub for version control and Slack for Giphy control. In addition to chat, we use Slack-based integrations with tools like Hubot, PagerDuty, and Looker to give us central access to many day-to-day tasks.
But you didn’t come here to read about how many Slack channels we have (241), you’re here to read about…
Most of CircleCI is written in Clojure. It’s been this way since almost the beginning. While there were some early spikes in Rails, the passion of a sole developer won out; by the time CircleCI was released to the market, it was written entirely in Clojure and has been at our platform’s core ever since.
Our frontend used to be in CoffeeScript, but when Om made a single-page ClojureScript application viable, we opted for consistency and unification. This choice wasn’t that hard to make, given how much we enjoy using Clojure. Having a lingua franca also helps reduce overhead when engineers want to move between layers of the stack.
That doesn’t mean we won’t sharpen other tools when warranted. The build agent for our recently launched 2.0 platform is written in Go, which lets us quickly inject a multi-platform static binary into environments where we can’t lean on a bunch of dependencies. We also use Go for CLI tools where static dependency compilation and fast start-up are more important than our love of Clojure.
But as we pull microservices out of our monolith, Clojure remains our weapon of choice. We’ve already got over ten microservices, and that number is growing rapidly. A major part of this velocity stems from using Clojure, which ensures developers can rapidly move between teams and projects without climbing a huge learning curve.
Our web app’s UI is written in ClojureScript. Specifically, we’re using the framework Om, a ClojureScript interface to Facebook’s React. This is currently in some flux, since we’re upgrading to Om Next, an Om reboot which fixes a lot of its quirks. You can read more about why we’re so excited in this deep dive by one of our engineers, Petra Jaros.
Two Pools, Both Alike in Dignity
There are two major pools of machines: the first hosts our own services — the systems that serve our site, manage jobs, send notifications, etc. These services are deployed within Docker containers orchestrated in Kubernetes. In 2012, this configuration wasn’t really an option. As functional programmers, though, we were big believers in immutable infrastructure, so we went all in on baking AMIs and rolling them on code changes.
However, rounding boot times and charges to the hour made using full VMs slow and expensive; rolling deploys in Docker with Kubernetes is much more efficient. Kubernetes’ ecosystem and toolchain made it an obvious choice for our fairly statically-defined processes: the rate of change of job types or how many we need in our internal stack is relatively low.
On the other hand, our customers’ jobs are changing constantly. It’s challenging to dynamically predict demand, what types of jobs we’ll be running, and the resource requirements of each of those jobs. We found that Nomad excelled in this area. With a fast, flexible scheduler built-in, Nomad distributes customer jobs across our second pool of machines, reserved specifically for scheduling purposes. In the Write Less Code, Use More Tools blog post, we highlight why we use Nomad for a fast scheduling solution.
While we did evaluate both Kubernetes and Nomad to do All These Things, neither tool was optimized for such an all-inclusive job. And we treat Nomad’s scheduling role as more a piece of our software stack than as a part of the management or ops layer. So we use Kubernetes to manage the Nomad servers.
We’ve also recently started using Helm to make it easier to deploy new services into Kubernetes. We’ve had to build a couple small services to string the full CD process together with Helm, while also keeping Kubernetes locked down — but the results have been great. We create a chart (i.e. package) for each service. This lets us easily roll back new software and gives us an audit trail of what was installed or upgraded.
For the last five years, we’ve run our infrastructure on AWS. It started simply because our architecture was simple but evolved into a necessarily complex stack of Linked Accounts, VPCs, Security Groups, and everything else AWS offers to help partition and restrict resources. We’re also running across multiple regions. Our deep investment in AWS led to increasing assumptions in our code about how the software was being managed.
When we introduced CircleCI Enterprise (our on-prem offering), we started supporting a number of different deployment models. We also started separating ourselves further from the system by packaging our code in Docker containers and using cloud-agnostic Kubernetes to manage resources and distribution.
With a much lower level of vendor lock-in, we’ve gained the flexibility to push part of our workload to Google Cloud Platform (GCP) when it suits us. We chose GCP because it’s particularly well-suited for short-lived VMs. Today, if you use our machine executor to run a job, it will run in GCP. This executor type allocates a full VM for tasks that need it.
We’ve also wrapped GCP in a VM service that preallocates machines, then tears everything down once you’re finished. Using an entire VM means you have full control over a much faster machine. We’re pretty happy with this architecture since it smooths out future forays into other platforms: we can just drop in the Go build agent and be on our merry way.
Communication with Frontend
When the frontend needs to talk to the backend, it does so via a dedicated tier of API hosts. These API hosts are also managed by Kubernetes, albeit in a separate cluster to increase isolation. Nearly all our APIs are public, which means we’re using the same interfaces available to our customers. The value of dogfooding your APIs can’t be overstated: it’s enabled us to keep the APIs clean and spot errors before our users find them.
If you’re interacting with our web application, then all of your requests are hitting the API hosts. The majority of our authentication is handled via OAuth from GitHub or Bitbucket. Once you’ve authenticated, you can also generate an API token to get programmatic access to everything we expose in the UI.
Our API hosts once accepted webhooks from GitHub and Bitbucket, but we’ve recently extracted that into its own service. Using a cleanly-separated service that dumps hooks into RabbitMQ allows us to more easily respond to a large array of operational issues. When version control system (VCS) providers are recovering from their own issues, we’ve seen massive spikes in hooks. Now we’re well equipped to deal with that.
Our primary datastore is MongoDB. We made this decision in CircleCI’s early days — lured like so many others by the simplicity of “schemaless” storage and rapid iteration. Having peaked at over 10TB of bloated storage in MMAP, along with painful, outage-inducing DB-level locks in Mongo 2.4, we’re happy to see progress being made in WiredTiger. Our operations have greatly improved, but we’re still suffering from a legacy of poorly-enforced schemas on a dataset too large to clean efficiently.
So we’re retreating to the structure of PostgreSQL. We’ve got a great opportunity for this migration as we build microservices with their own datastores. We’re also using Redis to cache data we’d never store permanently, as well as to rate-limit our requests to partners’ APIs (like GitHub).
When we’re dealing with large blobs of immutable data (logs, artifacts, and test results), we store them in Amazon S3. We’re well beyond the scale where we could just dump this kind of stuff in a database. We handle any side-effects of S3’s eventual consistency model within our code to ensure that we deal with user requests correctly while writes are in process.
A Build is Born
When we process a webhook from GitHub/Bitbucket telling us that a user pushed some new code, we use the information to create a new build or workflow representation in our datastores, then queue it for processing. In order to get promoted out of this first queue, the organization needs to have enough capacity in its plan to run the build/workflow.
If you’re a customer using all your containers, no new builds or workflows are runnable until enough containers free up. When that happens, we’ll pass the definition of the work to be performed to Nomad, which is responsible for allocating hardware for the work’s duration.
Running the Build
The gritty details of processing a build are executed by the creatively named build agent. It parses configuration, executes commands, and synthesizes actions that create artifacts and test results. Most builds run in a Docker container, or set of containers, which is defined by the customer for a completely tailored execution environment.
The build agent streams the results of its work over gRPC to the output processor, a secure facade that understands how to write to all our internal systems. This facade approach allows our 1.0 and 2.0 platforms to coexist.
In order to get this live streaming data to your browser, we use WebSockets managed by Pusher. We also use this channel to deliver state change notifications to the browser, e.g. when a build completes. We also store small segments temporarily in Redis while we collect enough to write permanently to S3.
A Hubot Postscript
We have added very little to the CoffeeScript Hubot application – just enough to allow it to talk to our Hubot workers. The hubot workers implement our operational management functionality and expose it to Hubot so we can get chat integration for free. We’ve also tailored the authentication and authorization code of Hubot to meet the needs of roles within our team.
For larger tasks, we’ve got an internal CLI written in Go that talks to the same API as Hubot, giving access to the same functionality we have in Slack, with the addition of scripting, piping, and all of our favorite Unix tools. When the Hubot worker recognizes the CLI is in use, it logs the commands to Slack to maintain visibility of operational changes.
Analytics & Monitoring
Our primary source of monitoring and alerting is Datadog. We’ve got prebuilt dashboards for every scenario and integration with PagerDuty to manage routing any alerts. We’ve definitely scaled past the point where managing dashboards is easy, but we haven’t had time to invest in figuring out their more anomalous features. Nor the willingness to trust that it will just work for us. We capture any unhandled exceptions with Rollbar and, if we realize one will keep happening, we quickly convert the metrics to point back to Datadog, to keep Rollbar as clean as possible. We’re also using LaunchDarkly to safely deploy new and/or incomplete features behind feature flags.
We use Segment to consolidate all of our trackers, the most important of which goes to Amplitude to analyze user patterns. However, if we need a more consolidated view, we push all of our data to our own data warehouse running Postgres; this is available for rapid analytics and dashboard creation through Looker. Many engineers who want to do their own analysis use tools they’re comfortable with, which includes
awk but also Pandas and R.
One of the great things about being a CI/CD company is that we get to practice what we preach. Instead of long dry spells between releases, we push several changes per day to keep our feedback loops short and our codebase clean. We’re small enough that we can move quickly, but large enough that our teams have the resources they need.
This is our stack today. As our customers deal with more complex problems, we’ll adapt and adopt new tools to deal with emerging tech. It’s all very exciting, and we can’t wait to see what the future holds.
While we wait for the future, though, there’s no reason you should be waiting for good code. Start building on CircleCI today and ship your code faster. Or come work with us and help us ship our own code faster.