Maurice Wilkes led the team that built EDSAC, an early computer, in the late 1940s. In 1949, he wrote about a realization he had while working on the computer. The EDSAC was housed on the top floor of the building with the tape-punching and editing equipment one floor below it:
“It was on one of my journeys between the EDSAC room and the punching equipment that ‘hesitating at the angles of stairs’ the realization came over me with full force that a good part of the remainder of my life was going to be spent in finding errors in my own programs.” — Maurice Wilkes, Memoirs of a Computer Pioneer, 1985
So, debugging has always gone hand in hand with programming. Practices like continuous integration or effective telemetry can help us to surface issues, but once a problem is in front of us we need to roll up our sleeves and figure out exactly why it is manifesting now.
The problem can be in our own code, in a colleague’s, or in some library or infrastructure we depend upon. Often it’s a subtle and confusing mix, comprising multiple interactions running right through our software and hardware stack.
It can be difficult to get from symptoms — like a problem report or a stack trace — to causes. It’s something we tend to learn on the job and unevenly, rather than in a systematic way. But debugging quickly and effectively is learnable, just like everything else in technology.
We can work back from the symptoms in an iterative series of steps. Or we can start at the “top,” where the system began its work, and follow code down to form a model of the problem.
Whatever the approach, the method is more or less scientific:
- Look at the facts we have
- Try to reason about the state the system was in when it broke
- Form a hypothesis about what happened
- Test it, using existing telemetry or new code, changing one thing at a time
- Repeat the process until you find the error
In a discussion of debugging across our engineering teams, Marc gave a great example illustrating this approach:
Last week I was debugging a strange RabbitMQ issue where messages were not being delivered as part of a migration between two clusters.
- Fact: The message was not delivered.
- Hypothesis: The message was not sent.
Looking at logs, I could see a line from immediately before the message was sent, and I could see no indications of an error or an exception sending the message.
- Fact: The message was sent to the queue.
- Hypothesis: The message was sent but not delivered.
Perhaps the exchange was not connected to the queue. So we opened the RabbitMQ UI and inspected the configuration. The queue binding on the new RabbitMQ cluster looked legit, yet it was behaving differently than the old cluster.
- Fact: The queue is bound to the exchange.
- Hypothesis: The topology on the two clusters is different in some way.
We looked at the old cluster on the web UI in one browser tab, and the new cluster in another tab, and flicked between the two tabs to see what was different. We saw a difference with routing keys.
- Facts: The old cluster had a routing-key defined for a queue binding, the new cluster did not. When we set up the old queue binding, we did it manually, whereas the new binding was set up in configuration management.
- Hypothesis: The configuration management code is not setting the bindings up correctly.
This was easy to confirm and fix.
It takes practice to restrict ourselves to just the facts when forming hypotheses. A common error is to assume, without evidence, that if we saw X, then Y must be true too, and then to hypothesize from a position of “X and Y.”
For example, we might see a set of tasks failing to make progress, assume the problem is capacity, and immediately scale up. However, if the problem is contention on a queue shared by our tasks, we only make it worse. This is an assumption we could have checked first.
On the other hand, experience can allow us to shortcut some hypotheses and move faster towards an answer. It’s a balance: educated guesses are great, but it’s vital that we make our assumptions explicit and validate our guesses before moving on.
Sometimes, especially for issues that are hard to reproduce, we will realize that we don’t currently have the information to test our hypotheses. Then, the trick is to get new tooling or telemetry in place so we can get it next time.
Write it down
It’s valuable to take notes as we go, particularly over a long investigation:
- Looking back at early hypotheses once we’ve gathered more data and filled in our mental model of the system, or with fresh eyes the following day, can expose incorrect assumptions and be a powerful source of new ideas.
- When we try to write down a clear explanation of something difficult, it helps us dig beyond the surface details and highlights the parts we don’t really understand.
- Having notes to share makes it easier to bring others up to speed. Debugging, like programming, works best as a team sport!
Like many skills, debugging rewards close attention, a little introspection about method, and learning alongside others. Try out the method above, look out for the pitfalls, and share your work with colleagues and friends. Read the resources below for some more context and ideas.
Debugging is a first-order skill for software engineers. As a challenge we face every day, any effort we put into getting better at it pays us back again and again.
- The Discovery of Debugging by Brian Hayes is a wonderful essay on Maurice Wilkes and EDSAC.
- What does debugging a program look like? and So you want to be a wizard by Julia Evans are full of good ideas and resources.
- Effective Mental Models for Code and Systems by Cindy Sridharan is a philosophical exploration of the mental models we form as programmers, with lots of great references.
- Computers can be understood and the follow-up Systems that defy detailed understanding by Nelson Elhage are particularly valuable in pointing out pitfalls in modeling and understanding large and complex systems.