CircleCI is a platform for continuous delivery. This means (among other things) we’re building serious distributed systems: dozens of servers running thousands of builds across hundreds of “container” hosts, coordinating between all the moving parts, and taking care of all the low-level details so that you have the simplest, fastest continuous integration and deployment possible.

In mid-March, we made some enormous infrastructural advances, starting with allowing a single build to use containers spanning several servers (previously, a parallel build’s containers all had to be co-located on a single server.) This, in turn, allowed us to crank up the parallelism: several of our largest customers went from running their builds across 4 containers to running them across 12 containers, over the course of a couple of weeks. Fantastic!

The features worked well in & of themselves, but the sudden growth put a lot of strain on our system as a whole, and blew several of what might otherwise have been small problems into huge ones. We spent the last week of March and the first few weeks of April in almost full-time firefighting. Here’s what happened…

issue #0 - dying servers

As is often the case for us, the first thing we (and our users) noticed was a queue backlog. Internally, we observed “stuck” builds, which our coordination layer thought were running, but were not actually running on any server. And we also saw builds failing, when the containers they were using became suddenly unreachable.

Very quickly we realized that our servers were dying. If the server which died was the “owner” of a particular build, that build became stuck in the coordination system. If, on the other hand, the dead server was just a slave doing work on behalf of some builds, those builds would fail because they couldn’t reach it anymore.

The stuck builds confused our concurrency logic, blocking lots of builds from running even though they should have been. Our scaling code also relies on being able to get an accurate high-level view of what’s running from the coordination layer, so it wasn’t making good decisions. And finally, our scaling code relies on working build infrastructure to deploy new servers! Lots of our attempts to scale up the cluster were getting stuck… just like everyone else’s builds.

For about a day, we mitigated the problems by hand while investigating. Manual mitigation involved starting many, many, many servers by hand to work around the scaling failures (and overcompensate for crashing.) We also built ourselves some crude tools for detecting and clearing stuck builds.

The first round of analysis quickly showed we were having kernel panics, causing the boxes to reboot (we do not set up our servers to automatically rejoin the cluster on reboot: we prefer to stick to the well-worn path of starting a new machine.) Our monitoring wasn’t catching these well because the servers very quickly came back “up”. To make sure we’d get alerted when a system reboots, we added heartbeat-based monitoring of the actual server process.

And then the other shoe dropped: the cluster leader died. The cluster leader in our system is very simply the most recently booted server: its job is to decide when the cluster should scale up or down. When the leader died, early in the morning Pacific time, the cluster became static, and completely failed to scale up for the weekday morning demand.

After a mad scramble to get the cluster back on its feed, we added a deadman’s switch, for the leader role only.

Once we’d bought ourselves breathing room with our changes, it only took a day to reproduce, isolate and fix the problem.

The speedy fix was, frankly, a lucky guess. There were some gnarly error messages in the system error logs which pointed roughly at LVM snapshot problems. (LVM is the linux volume manager, which among other things allows point-in-time snapshot volumes.) We reproduced the issue by creating the aptly-named ‘circleci/dummy-disk-filler’ project, and running its tests many times concurrently. The lucky guess was in the isolation step: the first thing we removed was a vaguely suspicious command: resize2fs, which we use to adjust the logical volume size to match the space available in the underlying physical device. Lo and behold, removing it completely fixed the problem!

The fix we eventually shipped actually just moved the resize2fs command. Instead of:

lvcreate image -> lvcreate snapshot -> resize2fs

it went:

lvcreate image -> resize2fs -> lvcreate snapshot

(which makes a lot more sense, actually.)

In the process of torturing the disks, we found and fixed several other defects in the LVM configuration, one of which was a kernel-panicking bug in slightly different circumstances!

issue #1 - slowclones

This bug manifested as – you guessed it – a severely backlogged queue, and a huge spike in the number of concurrently-running builds, but without a simultaneous spike in load. We were running a ton of builds… but they weren’t doing any work.

We quickly traced the problem to extremely slow git clones from GitHub. In particular, we found that clones of large repos were taking an extremely long time (hours) in git-upload-pack. They eventually seemed to finish successfully, though.

The first time this happened, it caused a bunch of nasty cascading failures. There was (at the time) no timeout wrapped around the git commands which we use to configure a project for building. Similarly, the ‘cancel’ functionality didn’t work during this stage, so people couldn’t “work around” the issue themselves by canceling their slow builds. But the worst thing was that our scaling infrastructure relied on GitHub’s good health, because each new server clones the code from GitHub – so we were almost completely unable to scale to the huge demand for (basically idle) build capacity. GitHub’s support suspected a bug with older git clients, and suggested that we upgrade our git clients to the latest version…

And then it went away.

In the lull between crises, we added timeouts and cancel-ability to the “configure the build” step, made some minor adjustments to the scaling algorithm, and upgraded git. We also started working on non-git-backed deployment of our own code, to keep our scaling working without GitHub (and to speed up our deploys, in general.)

When this problem came back, we did better: builds were slow but could time out (or be canceled) much more quickly, so the demand for idle build capacity was lower. However, we still couldn’t scale effectively without being able to clone our own repo, so there was still a big queue backlog.

During this outage we were able to trace the problem much more deeply. We discovered that packet loss between our servers (in EC2 US East) and GitHub (in Rackspace) was 10-30%, and we started working closely with some folks at GitHub and AWS to figure out what was causing the packet loss, and why packet loss caused slowclones.

And then it went away!

We made two quick, barely substantiated changes: we lowered the MTU on our servers to 1400, to match the manually-discovered path MTU between us and GitHub, to rule out any chance of a PMTUD issue. We also tweaked our git clone commands to try to be a bit less network-intensive (e.g. we made our shallower clones shallower, and cranked the compression settings.) Neither seemed to help.

The problem came back a third time, and we got some really odd stuff out of tcpdump, in particular a pattern of very slow retries of lost segments:

17:57:00.959562 IP (tos 0x0, ttl 64, id 46011, flags [DF], length 52)
    circleci.43171 > ack 95967, win 1584, length 0,
    options [nop,nop,TS val 1407819]
17:57:41.433879 IP (tos 0x8, ttl 48, id 29061, flags [DF], length 1500) > circleci.43171: seq 90175:91623, ack 5768, win 27,
    length 1448, options [nop,nop,TS val 4131799216 ecr 1407818]
17:57:41.433926 IP (tos 0x0, ttl 64, id 46012, flags [DF], length 64)
    circleci.43171 > ack 95967, win 1611, length 0,
    options [nop,nop,TS val 1417937,nop,nop,sack 1 {90175:91623}]
17:59:02.385413 IP (tos 0x8, ttl 48, id 29062, flags [DF], length 1500) > circleci.43171: seq 90175:91623, ack 5768, win 27,
    length 1448, options [nop,nop,TS val 4131819454 ecr 1407818]

Then, we shipped our new deployment code. The slowclones have come and gone a few times since then, and we cope with them as well as can be expected: we scale up fast to handle the spike in builds, but only the people whose projects are cloning slowly are impacted – the global queue doesn’t back up.

GitHub support and infrastructure folks have also been great: they acknowledge the packet loss each time, and resolve it as quickly as they can. However, we still don’t really know the root cause: we know that slowclones are triggered by bad network conditions, but that’s not a fully satisfactory explanation for why a 30-second clone should take 2+ hours.

interlude - bunk git upgrade

The git client upgrade turned out badly. The morning after it shipped, we were informed by our customers that we’d screwed something up, and most of their builds were failing. We immediately reverted the bad code!

This turned out to be a nice, straightforward bug. We isolated it easily: between git 1.7 and 1.8, the default behavior of shallow clones changed from --no-single-branch to --single-branch. In our system, this had the effect of breaking all non-master builds: we’d do a shallow clone and then try to reset to the correct commit, but it wouldn’t be there.

We fixed our stuff (including adding a non-master-branch end-to-end test!) and rolled forward again the next day without incident. Unfortunately, it didn’t help at all with the slowclones.

issue #2 - machines vanishing

After we’d resolved the kernel panics, while we were fighting slowclones, we also started seeing more and more of our servers vanishing. The symptoms were similar to those we saw during the kernel panic incident: “stuck” builds and spurious failed builds, and cascading effects on the queue.

The first round of analysis found that these boxes seemed to just “wedge” or deadlock: they didn’t panic or reboot or scream into the console. They just stopped responding to incoming requests, and any logged-in sessions became instantly unresponsive. After a reboot, the logs all appeared to stop at the same moment, with no complaints evident in any logs. Monitoring showed no dips, blips or spikes in memory, disk, network, cpu, etc at the time.

Without a lead, we put a lot of effort into mitigation. In particular, we extended the deadman’s switch so that it cleaned up nicely when a server vanished, and we parallelized our scaling code to be able to spin up way more servers simultaneously.

By the time those fixes were deployed, we’d spotted a pattern in the logs, even though it made no sense: 100% of the vanishing boxes were running a “restore cache” action. During “restore cache” we download a .tar.gz file, and then pipe its contents into tar -xzf - processes inside each container, over ssh. Pretty innocuous stuff!

In seeking a repro, we managed to make a box “vanish”, with identical symptoms, by overcommitting part of the LVM stack. This was very suspicious, since it was problems with the LVM stack which had caused our previous kernel panics, so we focused almost exclusively on that part of our system while attempting to reproduce this bug.

We managed to get a semi-reliable repro in our staging environment by hammering on the “restore cache” code paths – but it was very slow and (since it wasn’t reliable) difficult to interpret results.

We made a lot of barely-substantiated changes while we tried to pin this bug down: we fiddled with the initialization of our LVM stack, serialized most of the container initialization steps, and added a bit of swap to the servers. Several times, we thought we’d found the smoking gun… but in the end, none of these tweaks helped.

We also serialized cache restoration: in a parallel build, we had been piping the tarball contents into all containers in parallel. Switching this to one container at a time vastly improved things in production. It was still pretty broken, and much much slower, but we had a bit of much-needed breathing space.

We spoke with friends at, support at AWS, and scoured the internet for bug reports and workarounds that might conceivably be related to our issue. We got a lot of advice, a lot of tips, and a lot of leads. We must have tried a hundred permutations of the core LVM stack. Along the way, we identified several scary, possibly related-looking issues in various components, and upgraded just about everything in the vain hope that it was Someone Else’s Problem. None of that worked (though the LVM upgrade, in particular, was something we’d been putting off for a while, and gave us some new superpowers.)

Finally, three hair-pulling weeks into this bug-hunt, we reached the point of going back to re-check all the conclusions we had made so far. Very quickly, an experiment found that the boxes hadn’t been deadlocking at all: they were just dropping off the network! Argh! There was a lot of (╯°□°)╯︵ ┻━┻ in hipchat…

Within an hour of this revelation, and focusing on the network instead of the LVM stack, we had a simple, reliable repro script:

  1. start a fresh cc2.8xlarge instance with the AMI we use for our servers

  2. run our system initialization on it

  3. run 20 concurrent copies of cat /dev/zero | ssh $PUBLIC_IP 'cat > /dev/null'

These steps dropped the box off the network within seconds, every time.

As soon as we could reproduce the issue reliably, we started to remove things from our system init to isolate the combination of factors that broke the network. Two hours after that, we had a fix in production.

So… remember when we were fighting slowclones, and we tweaked the MTU down to 1400? Well, when we took out that part of our system initialization, the problem went away.

What an anti-climax :(

We sent some repro scripts and data upstream for other people to investigate why something as innocuous as changing the MTU would cause complete networking collapse… but we haven’t chased these up. We have no answers… only a working system :P

interlude - schejulure bug

At 12:11am on the first Sunday after we started fighing the vanishing servers bug, we were alerted by a slew of very strange errors. The deadman’s switch had misfired! Servers were constantly pushing each other out of the coordination system, even though they were fine.

We tracked down the bug to schejulure, a simple scheduling library we use: it was using 0-based day of the week, whereas the underlying time library used 1-based day of the week. We forked their code, fixed the bug, switched to our fork… and things went back to normal.

We even filed a pull request, so now everyone else’s scheduled tasks will also run on Sundays :)


This was a brutal month, but on the whole, it worked out well. We weathered the storm, and our infrastructure is massively more robust in the face of failures and odd load patterns than it was in March. We have better monitoring and alerting. We’ve refined our outage communication processes, so our customers hear about problems from us, not the other way around.

Most importantly, we now have the all-purpose (shitstorm) emoticon in hipchat: shitstorm-1364257072

We want to apologize to everyone whose builds failed or ran slowly during this time. We also really want to thank those customers who said nice, supportive things – it seems many of you had similar problems during your growth, and your support was really appreciated during our most difficult days of desperate bug-hunting.

There’s still a lot of work to do. Here’s hoping we don’t have to use that emoticon too often!