In February, CircleCI (and basically half of the Internet) was severely affected when AWS experienced an S3 outage. Although outages are never fun, they are a prime opportunity to check your incident plan, work as a team, and improve your systems and performance.
While we do everything we reasonably can to provide uninterrupted service, we will (very rarely) experience degraded service, or even an outage. Sometimes, this is because of issues introduced by our team, or mistakes we make.
In addition, we sit between–and are dependent on–many other services (GitHub, Bitbucket, AWS, Heroku, etc.). So when one or more of those experience outages, we will also feel the pain.
Here’s what we do–and in some cases, how we learned to do it the hard way–when things go sideways.
There are a lot of different methods to handling Incidents (just Google “Incident Response”) but there are a few things that all of them have and that is first to have a plan, no matter how simple and second to follow the plan. Those both seem obvious now, but it is amazing how many people assume that they can just throw smart engineers at a problem and have that be their “response”.
Parameters for our Incident Response Plan
At CircleCI our incidents are managed by people across many teams, and we all use a number of rules:
Anyone can declare an incident. Often it’s not the SRE team or an engineer who notices that things are “not quite right”
Once an incident is declared the conversation moves to our #incident chat room - this helps consolidate all of the information about the incident.
- We then decide on who is performing two core roles: commander and communications - (I’ll go into more detail on these later in the post)
- Incident Commander - the person responsible for the incident
- Communications - the person responsible for status updates
- Identify who is helping from Engineering
Guidelines for During-Incident Response
After we’ve fulfilled the criteria above, we start to debug and determine the root cause of the outage. We often also start determining what temporary changes are needed to handle the cascading “messy bits” that outages always seem to create.
At the very start of the incident response is when you need to be clear about who is filling the roles above. The Incident Commander needs to be the person who is not necessarily working on solving the incident but also has to be fully present - they are the one making sure that comments and small todo items are not forgotten. Another primary task for them is to ensure that the communication channel is kept clear of extraneous chatter.
The Incident Commander also needs to lean heavily on the Communications person so that they are not bogged down in fielding all of the “hey, what’s the status” questions that often occur during incidents - the Communications person is also the person to make sure that items like Status Pages and other customer-facing communications are handled cleanly.
We’ve adopted a status policy that where the Communications person reminds the team that an update is needed by using a 20 minutes timer, the Commander will then recommend an update and the Communications person will word-smith it to follow the amazing style and tone guide that includes examples of status messages. These examples, written by our marketing team, allow us to circumvent fretting over word-smithing while hip-deep in issues.
Once the root cause of an incident has been determined, the Commander will indicate to the Communications person that we need to change the Incident Status to “Identified”. A similar change is made when we deploy a fix, and also when we’re confident the fix solved the problem. After the fix is verified, we maintain a final “monitoring” phase for 30 minutes to ensure that the incident is truly resolved.
A key part of our Incident Response Plan is that we also make sure that everyone who’s responding to the incident gets a chance for breaks, food and other self-care–nothing gets solved if the team is not getting enough sleep, or food, or a minute to think.
Another aspect of our focus on self-care is the understanding that an incident is owned by the whole team and that allows us to take advantage having staff based in multiple time zones. Long-running incidents often move across time zones as they flow to that time zone’s work day. The ability to have incidents “follow-the-sun” is really only possible because we work to ensure that knowledge about our system is distributed across all of our teams.
Lessons Learned the Hard Way
Some other items in our Incident Response Plan we had to learn the hard way :)
- We don’t change the server counts during an incident, especially during AWS-related incidents. AWS often throttles API calls and then you end up competing with the thousands of other companies who are also trying to start instances.
- When possible, we shed load by turning off low-priority background processes to free up extra resources.
- We keep chatter and speculation in our #incident Slack channel to a minimum, redirecting side discussions into our normal #ops and #engineering channels. It’s not unusual for engineers who are not working on the incident to monitor #incident to keep abreast of the issues involved. It takes some practice and self-discipline to keep #incident ruthlessly on-topic, but it can be done.
- If appropriate, we also disable certain features to reduce the amount of churn we would get. An example of this is our auto-retry of all “circle-bugged” jobs, which if left on during an external outage could easily overwhelm our job queue with retries.
Post-Incident Action Items
After the incident is over, the real work begins. You should always do an incident post-mortem, even if it’s only the Incident Commander reviewing the log and updating docs and/or run-books. If the incident was external, you can still discover edge cases or single-points-of-failure within your infrastructure that can be improved. If they cannot be improved for various messy real-world reasons, then at least add monitoring around them so you can be alerted before the behavior becomes an incident.
After the post-mortem, the Incident Report is created and posted internally. Posting our Incident Report in a public location is something that we only just started doing after receiving feedback from our customers that they are very interested in hearing about the “why” and “how” of incidents so we have added that to our Incident Checklist.
I have two last recommendations about incident response plans. The first recommendation is that you should review your plan after every incident to see how it can be improved. Our plan has changed significantly over the years, and updating it is the best way to capture and preserve knowledge and best practices around incident responses.
The second recommendation is to make sure everyone knows your Incident Response Plan and knows how to use it. Like many things, the plan’s useless if your team doesn’t know it, know where to find it, or know how to implement it.