Batter Up

Like most programmers, I spend a fair amount of time chasing down and repairing bugs.

A lot of bugfixing is about as fun as whack-a-mole: a bug pops up, you mash it with the mallet, you wait for the next one. There’s some satisfaction from whacking those little brown lumps… I guess… but the game never seems to end! I think this is what people fear when they think of maintenance programming: it’s like a game of whack-a-mole… forever.

From time to time, I get a bug that’s much more interesting. I start out, mallet ready, thinking it’s a game of whack-a-mole… but the moles seem to be playing chess instead, and it looks like I’m going to have to take the top off the cabinet to figure out what’s going on… and I must say, nothing improves a game of whack-a-mole like playing it with a soldering iron instead of a foam mallet!

Anecdotes! Clearly you demand them.

Here’s a pull request description from last month:

> > fix a stupid type bug, which is confusing and prevents stop-shutdown logic from working by prematurely putting masters into the :final-shutdown state. > >

This was a fairly serious issue: it caused our scaling to lag, and it was operationally costly. But the diagnosis and fix were utterly mundane: a missing type coercion meant some code was handling a “true” instead of a true value – a string instead of a boolean. Unfortunately not in a type-checked namespace ;)

On the other hand, this next PR addressed an issue with very low impact: a small percentage of one customer’s builds failing the wrong way – that is, they were doomed builds, but we mis-reported the failures:

> > This fixes two little bugs, both of which conspired to create some circle-bugs over the last few days. > > > > Bug #1: the c.b.ssh/scp call to with-session wasn't guarded from trying to connect to :oom? true nodes. Bug #2: c.b.action/detect-oom returned nil instead of true when the build was already marked :oom? true. > > > > So the old behavior was: > > > > rspec ooms cucumber inference still tries to scp (via remote-spit) to the node scp eventually throws an ssh exception the exception is mishandled because detect-oom doesn't return true > > > > And the new behavior: > > > > rspec ooms cucumber inference calls remote-spit scp throws instantly, without trying to connect to the node the exception is suppressed because we know the node is oom > >

But impact schmimpact: that second fix was at least 1000 times more satisfying to track down!

I have a theory: one bug is boring. The fun bug hunts are always the ones where there are two or more interacting bugs.

Why? My hunch is that singleton bugs are fundamentally different, in that they don’t actually require problem solving or mental modeling. The bug is just… there. It’s in the text of the code, waiting to be found by careful inspection. Whether the bug is a missing type coercion, or a piece of information that’s missing from the UI, or whatever, there’s not much nuance to explaining the problem. The solution might be interesting, but we’re talking about the hunt, here :)

On the other hand, as soon as you’ve got at least two bugs interacting, you have to mentally model the behavior of the system. You get to come up with crazy, off the wall hypotheses to try to explain the evidence (the bug report, your logs, and so on), and then experiment – either against the real thing, or against your mental model of the system – to see if any of your hypotheses hold up. It’s science!

This even seems to be true of customer support. Sometimes we’re just working through mundane issues, one at a time:

> > customer: "Hello, our build queue is growing quite large, and it looks like we're only running one build at a time. I believe we are paying for 3 concurrent builds. Can you look into this?" > > > > us: "Your plan supports 3 containers, and your main project takes 2. If there are two other builds in any other project, that main project build will be blocked. We're adding UI to expose this real soon (hopefully this week). Sorry for the trouble." > >

… but other times, they’re doing something weird, which causes us to do something weird, so we have some fun:

> > customer: "Hey guys, we're still seeing only 2 builds running occasionally even with 3 containers added to our plan for a good long while." > > > > us: "This is a really interesting case. It's a bug on our end, but a bit of an outlier. Let me explain: the way our caches work is that we take a readonly snapshot of the filesystem, and then kick off a background thread which creates the cache against that snapshot. After the build finishes, we join the thread. So: if that background thread takes longer than the rest of the build, it will block at that point. Specificially, it holds that build's container(s) for longer than the build time we show in the UI. This is pretty rare, and your case is actually really extreme: it takes around 5 minutes after your builds have "finished" for us to finish tarring up the cache!" > >

Those customers were both experiencing the same symptom: their builds weren’t running when expected. Both needed our support, because the platform wasn’t giving them enough information to understand what was going on. But that second case was enormously more interesting to dig into, because something they did with their build turned one of our features (“background cache saving! neat!”) into a bug. Now the moles are trying to whack me!

I should be clear: we provide support for mundane issues, and we fix mundane bugs. We’re professionals, and whether it’s interesting or not, we do the work that’s needed. Still, I like to reflect on why some of the work is so much more fun…

Even operational issues seem to follow this rule. Whenever we have an operational incident, we produce a brief incident report afterwards. Here are two of our recent ones. First:

> > We had a queue backlog because: * our low-water scaling threshold was too low > >

Then, a few weeks later:

> > We had a queue backlog because: * 14 hours earlier, there was a tiny networking hiccup within AWS. * a few of the machines affected by this hiccup got stuck in an AWS library network call... without a timeout. * one of those machines happened to be the elected cluster leader, responsible for scaling. * the stuck networking call happened to be run while this machine was holding the scaling lock. * the scaling lock also lacks a timeout. * as a result, scaling completely stopped for the past 14 hours * for the first 14 hours, this was fine, because it was a low-demand period * a week earlier, we changed our alert configuration * a bug in the alerting software caused all these alerts to be silently dropped * as a result, the alarms which should have warned us when scaling stopped were dropped * as a result, we weren't alerted about the queue backlog until customers complained > >

Same symptom and same customer impact, but you’ve got to love that second one – I mean, bugs in two third-party components (AWS and the java AWS library) triggering a latent bug in our software, and then a bug in another third-party component (monitoring) causing us to miss the first bug’s impact. It was a nasty outage and a gnarly debugging session… but very, very satisfying to be able to diagnose exactly what went wrong, and why!

One thing can really sap the joy out of one of these bug hunts: it’s the sinking realization that I’ve done this before. Which is why the only actual advice I’m going to leave you with is this: write integration tests. An integration test serves as the proof of your discovery (you did watch your test fail, right?), the proof of your solution, and most important, as a guarantee that you’ll never have to retrace the steps of this particular bug hunt again.

I certainly don’t condone deliberately complicating or obfuscating your software just to make the debugging more interesting. Well… maybe just a little ;)