I was recently part of a panel at #SRECon, where I shared my thoughts on how to onboard Site Reliability Engineers (SREs). For those who missed it, I’m sharing what I discussed for those who are either considering joining an SRE team or are actively onboarding/managing SREs.
I graduated from a program called Digital Futures, where I studied game design, programming installations and wearable tech. After working for a bit in the design industry, I decided I wanted to be programming instead. When I started at CircleCI, I was a Junior DevOps engineer, handling scaling and daily server maintenance.
The job didn’t require in-depth technical knowledge or programming skill, but it did require learning quickly and adapting to a changing environment. In fact, it’s worth listing the non-technical qualities of a potential SRE:
What to Look For in a Potential SRE
- Curious: Pokes things and takes them apart. Researches topics and asks questions.
- Determined: Sees tasks through against all odds.
- Self-taught: Reads books/documentation and experiments with tools.
- Adaptable: Has a positive attitude towards change. This job constantly changes.
- Automates: Creates systems (not necessarily technical) to solve repetitive problems and increase efficiency.
- Calm: Works well under pressure or emergency situations.
All these qualities are what make an SRE thrive in their environment and get up to speed quickly.
Getting Up to Speed As An SRE
Read the docs and revise them. By editing them and finding errors, you’ll be filling in gaps in your own knowledge. Sometimes these docs won’t exist. In those cases, make them. Don’t worry if you don’t understand everything; that comes with time.
Attend meetings and listen to how people speak about systems and concepts. Get the lingo down so you’re not tripping over terminology. Take notes to reinforce your learning.
Poke around and ask questions. Don’t be afraid to ask what a script does or why it’s being run. Don’t ask questions at bad times, like during an incident or when everyone’s gone home for the day or on holidays.
For Your Manager
Pair program with your new SRE and encourage them to shadow experienced engineers. A new engineer will learn much more quickly with a real human on real projects than by reading docs.
Make sure your team keeps the docs up to date. You won’t always be able to spare the engineers for a pairing session, so it’s important to have context and details available elsewhere.
Give manageable projects with clear goals. Open-ended projects can be intimidating to new engineers who don’t know the codebase. Structure is your friend.
Gradually introduce SREs to being on call. Have them shadow other SREs while they put out fires. Next, let them do it themselves with other SREs on standby. Finally, let them have the full responsibility, with mentorship as needed.
For Your Teammates
Don’t give answers to the new SRE person. Ask leading questions and allow them to arrive at conclusions on their own. These memories and lessons last longer.
Show them where to look. If there are helpful docs that aren’t obvious, point them out. Showing them the process of finding information at your org is helpful as well.
Have your new SREs go through the docs. Fresh eyes means fresh finds, and they’ll learn about the product as they correct mistakes and fix holes.
Don’t assume they know what you’re talking about. Define concepts, terminology or jargon that are unique to your organization. Confirm their understanding by asking them if things make sense. If they’re not asking questions, it’s too easy or too hard. Figure out which.
Don’t dump knowledge. Talk about why you’re sharing something and how you learned it. Don’t expect your new SRE to remember everything on the first try.
Explain your escalation policies. Keep them airtight by ensuring you have redundancy and multiple backup contacts.
Share war stories. Knowing the CTO took all of production down for more than 24 hours and still has a job can be extremely comforting to a new SRE.
More detail is preferable. It gives a new SRE the chance to say, “I already know that,” instead of “I don’t understand a single word you just said, can you start over?” It’s a small way to give them a confidence boost and keep them motivated.
Thinking About Failure
Since risk is always present, it helps to define how your organization thinks about it. We think that:
Post-mortems should be blameless. Most incidents involve many factors; pointing fingers doesn’t help find them.
Mistakes should be congratulated. This means your new SREs are learning!
War stories are badges of honour and should be worn with pride. Share these stories freely. This will help new SREs understand that:
- Failure happens.
- Failure will happen on your watch.
- You need to be prepared to deal with failure when it happens.
- You will still have a job when you fail.
Onboarding an SRE is really not that different from onboarding a vanilla software engineer. There’s just more risk and thus more fear. Hopefully these best practices and info will help you, whether you’re looking for potential hires, starting an SRE role, or helping someone else start theirs.
Huge thanks to Ruth Wong for organizing the panel and thanks to @lizthegrey and @jennski for tweeting about what we discussed.