Observability and CI/CD: meaningful measurement with Charity Majors
Sep 17th, 2021 | 27 minutes
Charity is an ops engineer and CTO at honeycomb.io. Before this she worked at Parse, Facebook, Linden Lab, etc on operations and developer tools, and always seems to wind up running the databases. Co-author of O'Reilly's "Database Reliability Engineering". Loves free speech, free software, and single malt scotch.
Rob Zuber is a 20-year veteran of software startups, a four-time founder, and three-time CTO. Since joining CircleCI, Rob has seen the company through its Series F funding and delivered on product innovation at scale while leading a team of 300+ engineers who are distributed around the globe.
Rob celebrates CircleCI's 10 year anniversary with the company's longest-tenured engineer, Gordon Syme, and one of our newest employees, JP LeBlanc, to discuss the company's past, present, and future.
Ben Sigelman discusses observability and how it connects with delivering change with confidence. Find out what is observability vs monitoring, tracing, and logging? Do they all have their own jobs or is there overlap?
Rob Zuber: Hello, and welcome to The Confident Commit, the podcast for anyone who wants to join the conversation on how to deliver software faster and better.
Rob Zuber: We’re going to talk about observability, probably, and probably a lot of other fascinating things knowing Charity. I’m your host Rob Zuber, CTO of CircleCI, the industry leader for all things CI and CD. Charity, thanks so much for joining me here today.
Charity: Thanks so much for having me.
Rob Zuber: So awesome to have you. We’ve obviously traveled in similar circles for over 20 years I think now. For anyone who wouldn’t know this…
Charity: Jesus Christ.
Rob Zuber: … we both worked at Critical Path.
Charity: Oh my God. You were at Critical Path?
Rob Zuber: Yeah. At different times.
Charity: Oh my God. Crazy.
Rob Zuber: I joined just after you left.
Charity: I was 17 then.
Rob Zuber: I wish I could say I was 17 then, but I will let anyone do the math on where I’m at. But a lot of experience, a lot of change in the industry. And then at some point, I don’t even know when this was exactly, but I was building some mobile apps, I guess in 2013, and we were building them on top of Parse. I did my best to design good schemas and I apologize for whatever it was that we did, and then saw as you left there and went on to do Honeycomb after the Facebook acquisition. Well, now CircleCI is a honeycomb customer. We love the product. I actually…
Charity: And we’re a Circle customer.
Rob Zuber: It is a little bit like that in this industry, but hopefully that’s because we’re all doing good things that we’re excited about. I’m excited to get into all of that, but one of the things that really made me want to have a chat with you is, I noticed that a lot of the time when you’re speaking publicly, you talk about observability for sure and reliability, but one of the things that keeps coming up, is CI and CD and let’s just say some fairly strong opinions about how that…
Charity: Just a little bit.
Rob Zuber: That’s probably true of a lot of what you’re talking about, but that’s cool. I want to hear the strong opinions. And I’m curious, well, let’s just start right in the middle there and then we can work back to some of this other pieces. But sitting where you’re sitting, looking at production systems where, as we said, you have a lot of experience and we’ll kind of dive into that, what is it that motivates you to be so interested and so focused on the delivery process?
Charity: Well, a kind of shitty engineer programmer. Well, the honest answer is, I was diagnosed last year with ADHD, which apparently explains a lot of it. I struggle with sitting there and planning what I’m going to be doing tomorrow, next week. It’s like a highway open, just stretching out as far as you can see it and it’s very tedious. I can’t motivate myself, that a virus, or who knows what’s going to happen, or every day is a different problem, I feed on that adrenaline. I’m never calmer than I am when the site is down, nobody can get it up and if I don’t figure it out, the company is fucked. I love that. As an industry, we’re always trying to build ourselves out of that sort of a job, which is good because there’s always more fires to fight. The unpredictability, the chaos-ity, just systems. I find systems more interesting than software, I guess.
Rob Zuber: Got it. Got it. And so that led you, I guess, into more of the operational side. One thing I mentioned being a Parse customer, and I think I remember you talking about Parse when you were working on that, and the fact that you were trying to manage a large scale database, but had no control over what people were choosing to do with it.
Charity: Any of the schemers or queries or anything that people were going to dump in. And not just a database, but by the time we got acquired by Facebook, we had 60,000 mobile apps, and Parse, there were over a million by the time I left, so that’s a million little agents of chaos. I’d wake up every morning and a different one would’ve hit the iTunes top 10. It’s just like, “Well, what are we going to do about this?”
Charity: Because often, the ones that were causing the problems were not ones that were causing the most traffic, or that were having the most users, or that were the most by volume or anything. So all of those top 10 lists that got generated by your new relics and your APMs, basically, they were just noise. Everything would get slower at once because there’s this choke point of the API servers. We didn’t use threading. We had Ruby on Rails, so it’s a fixed number of workers. So everything would get slow whenever anything got slow and trying to find a prime mover there was insanely difficult. It was basically impossible.
Rob Zuber: And so did that experience, that sort of, the problem isn’t necessarily coming from the largest volume, did that really shape how you ended up thinking about observability?
Charity: Absolutely. I consider myself a good engineer. And it was really damaging to my self respect, the way parts kept going down and down and down, and every time, and it’s just like, “What is happening?” We were doing a lot of things before their time. We were doing microservices before microservices were really understood and people really understood the failure conditions of them. We were doing a lot of these things that were just kind of cutting edge. And we had this problem of whenever you’re a platform, it’s your job to be running your platform, but then to be running their apps too. And like you said, you have no control over what they bring to the table. You just have to take it and deal with it naively, because as a platform, anytime you have to think about a single customer, you’ve probably failed in some way.
Charity: But we just ended up firefighting, doing database DBA stuff for app, after app, after app, and it got exhausting. And I tried every tool out there, but anything that uses metrics it’s so focused around, you decide up front what metrics you want to collect and then [inaudible 00:06:20] for you. Well, I don’t know what metrics I’m going to need. It’s a different one every time, and you can only pre-generate so many and you can’t slice and dice on the ones that you have. It’s just a bad fit. And with logs, it’s like, well, if I know what to look for, it’s the same thing. If I know what to look for, I can find it, but I don’t. I’m lost.
Charity: And we started getting some data sets into [inaudible 00:06:46] after a year or two of Facebook. And just the amount of time, it used to be kind of, “How long will it take us to fix this? I don’t know. It’s different every time, days, hours.” It dropped like a rock to seconds, not even minutes. But every time, with no knowledge of what the problem is going to be, we could just follow the trail of breadcrumbs to the answer every time. We didn’t have to guess or jump to the end, and that blew my mind. We recovered so many developer hours and ops hours once we just were able to stop firefighting and find the solutions and that really, really made an impact on me. I was like, “I can’t go back to not having this.” I would be so, so much less powerful as an engineer. It’s just unthinkable.
Rob Zuber: One of the other things that I’ve been thinking about a lot lately is just the evolution of CircleCI over the time that we’ve existed, has been driven a lot by changes in just software development. There was no Docker when CircleCI was started, so the idea that I’m just going to break this down into a bunch of services and manage it, that was for a certain class of company. Now it’s democratized or whatever word you want to use, which kind of means, “Here, we gave you all this complexity that you don’t understand, please enjoy using it.” And so, do you think that’s driven the… When I think about what’s most interesting to me inside of Honeycomb, it’s the ability to see traces and spans and how things are flowing through the system, which is a complex problem. Has that been a driver of a need and adoption?
Charity: Yeah, no doubt. It absolutely, I think, was driven by two things, by the old microservices, like you said, the wide scale adoption of microservices and the inexpensiveness of hardware. You think back, why did we have metrics? Because that’s the smallest amount of space you could possibly use to get something interesting about your software. It’s just a number, just a number with a couple of tags, right? You discarded everything else about what’s happening. You just kept the number.
Rob Zuber: The way that you’ve solved the problem is very cool, but you solved it and now everybody can use it as opposed to everybody saying, “Okay, now we need a team of engineers to go sort out how we’re going to deal with all this data and age it out because we can’t afford to keep it in a database.” Well, I love that story also because I think it’s very reflective of your personal, I’ll say obsession with databases, having been on the wrong end of some poorly designed databases. So taking that in really thinking through, how do I do this in a reliable way? Data stores for so many folks remain this single point of failure or complexity that we’re doing [crosstalk 00:09:27] we’re doing scheduled down times because we don’t have…
Charity: I’ve spent my entire career telling people, “Never write a database. Never write a database.” We were lucky. We were lucky to exist. We could very well have gone under a couple times. We found some of our investors who were a little bit more patient, but we very nearly got wiped out of existence because there’s such a steep, steep curve to doing anything novel in infrastructure because it takes so long to do something different. [crosstalk 00:09:59] There are the things that… Go ahead.
Rob Zuber: There’s such an interesting point in there about business building, which is certainly, I didn’t think we were going to talk about business building, but let’s do it.
Rob Zuber: What’s really important is understanding what your unique differentiation is. So that particular space, the space that you’re in, if you conclude, acknowledge, however you want to think about it, the thing that’s going to allow us, as you said, to be uniquely differentiated in this space is if we actually make the investment in data storage, because that’s the hard problem here, then that is actually creating a huge amount of value. It takes a little bit of a leap of faith.
Charity: It’s longer.
Rob Zuber: If you look at the, I don’t want to go too far down this path, but the history of tech companies in the valley or whatever, the amount of hard tech that we’re building now is as a percentage, way, way lower…
Charity: Quite small.
Rob Zuber: … than we used to. People are building self driving cars and so there’re interesting problems, but a lot of it is, I’m stringing together a few pieces of interesting technology that someone else has built in a new and novel way.
Charity: Which is the right thing to do. If you can do that, you absolutely should. Have you heard of the concept of innovation, innovation token? It’s the idea that as a startup, say you’ve got two innovation tokens and you should spend them wisely on the things that make you different as a business and then you should do the most boring thing possible for everything else. You were asking about microservices earlier. I’m like, “If you can get your job done with a LAMP stack, by God, please do so.” The problem is, the reason we’re embracing all this complexity is not because we want to, it’s because increasingly we have to in order to do what we need to do.
Rob Zuber: Yes, yes. I think we’ve taken on… This is a whole other area, but I think there are plenty of places where we’re taking on way more complexity than we need because that’s what we’re told that’s how we find…
Charity: Oh sure. But we don’t know until we find it with our face.
Rob Zuber: I think that’s exactly, exactly right. Well I think that all of that is fascinating. I think that again, the perspective of what’s uniquely interesting or novel that we get… I mean, we spend a lot of time, or I spend a lot of time telling people, “Of course, you should never build a CI and CD platform. We do, but that’s our job. That’s the one thing that we do. That’s the place where we’re going to innovate.”
Charity: It’s our one job.
Rob Zuber: “And so please, please do less of that and do more of your business. And I’m sure you’re in the same space. I will solve this data storage and large scale data processing problem for you so you have the insights to manage your system, deliver great value to customers. So let’s talk about CI and CD a little bit. I know that’s not exactly your main focus, but at the same time, it’s very tangential, I think, in that we’re all, I believe, in service of trying to get more of that business value into the hands of customer, however you want to think about it.
Charity: We’re trying to do it better.
Rob Zuber: Trying to get folks…
Charity: We’re trying to help people do it better.
Rob Zuber: Right. I think that how we deliver and then how we understand our production environments again, tightly correlated. So, what is it that, when you’re thinking about CI and CD again, you’ve said a couple pretty interesting things like, “A, this is critical, have it,” but B, I think your time cap was 15 minutes. Anything over 15 minutes is a waste of everybody’s time or is a risk to your business. So tell me a little bit about just what drives that thinking and what experiences led to that perspective.
Charity: Sure. In two minutes or less, right?
Rob Zuber: Sure.
Charity: I can rant about this all day. It came out from the perspective of, of course, I want my company to succeed and I started realizing that one of the biggest obstacles to it’s succeeding is the long delay times. If it’s going to be a month until you see your software, then it doesn’t matter how well you can instrument it for real time responsiveness. It doesn’t matter. There’s this virtuous loop that we were born with as software engineers. We were born with the, we write some code and we look at it. We write some code and we check it out in production. The writing and the making sure the writing was done well, are the same thing. You don’t just write a novel and never read it, give it to someone else to read and edit. It’s the same thing. That was a bad way to ever break… We’re always looking for ways to be efficient and specialize in everything but that was a bad way to break things up. We should have never built that wall in the first place.
Charity: If you look at, for engineering, software engineers, software engineering team, they spend so much of their time waiting on each other. I think about the software engineering death spiral, where if it takes hours for your stuff to get out, the right way to do it, in my opinion is start there. To whenever an engineer merges their changes to main, it should automatically kick off a CI/CD pipeline run to produce an artifact which should get deployed to production in 15 minutes or less so that you can develop memories and engineer your instrument and your code as you’re writing it.
Charity: And you’re looking, how is my future self going to understand this merge? And then you go look. And you ask through the lens of your instrumentation, Is it doing what I expected it to do? Does anything else look weird?” I think of that as observability driven development. CI/CD, but it encompasses the real world too. The big, the mammoth thing that I think is important here is, number one, how much time people spend doing the stuff that sucks the life out of you? It’s not the good stuff of engineering, it’s the terrible stuff. It’s the frustrating stuff. It’s the toil. All of that goes up and up the longer your bill pipeline is.
Charity: And number two, just the sheer wastefulness of that. Ever looked at a company and like, “Oh, cool product. Oh my God, they have 800 engineers. What do they all do?” I guarantee you, they have very long and torturous CI/CD pipeline. So I was just thinking about all this and I was like, there’s all this advice out there for people about how to make [inaudible 00:16:21] better, how to do your [inaudible 00:16:22] better, how to do all this stuff. And they’re all just batting away at the symptoms, when if there’s one pressure point that you can start at that makes things flow correctly for everyone else, instead of fractal badness, it starts right there with that [inaudible 00:16:40] from when you’ve written the code and when the code is live and making it that virtuous feedback loop that you own.
Rob Zuber: Well, I’m a fan of this theory, unsurprisingly, since I spend all of my days try to solve this problem for folks. I think one of the things that I was thinking there as you were describing is, I think I just got a strong lesson in queuing theory as well. it’s not just the 15 minutes, but how big your org is and how many things are backing up. And when they start to back up, how long does it take to clear that out? And so development flow and how it associates with that is absolutely huge.
Charity: You can see how it flows out and it magnifies it every step. It gets bigger and harder and more complicated and more confusing and more waiting at every step.
Rob Zuber: I would imagine there are organizations, I can’t think of any of our customers, which is probably a good thing, but I imagine there are organizations that are waiting for the next morning and hoping that the queue is finally cleared before people get back.
Charity: For sure.
Rob Zuber: A lot of organizations are not all working in the same office anymore, because well, a lot’s changed recently. But hoping that it clears before they start piling more into there and I can’t even imagine what that would do for productivity. So I think you also tend to identify other issues. When you say, “Wow. Why can’t we get our CI pipeline or CD pipeline down below 15 minutes?” Well…
Charity: Good question.
Rob Zuber: Well, this piece of testing is way too hard to get done in 15 minutes. And you stop and say, “Is that okay? Are we comfortable with something that’s so complex to test, or we need to find another way to build this. How can we paralyze it?”
Rob Zuber: Well, this test has to happen before that test, which has to happen before that test. Well, that’s a problem. Let’s stop and think about that. I think it pushes design. It in the same way as kind of test driven development gives you, “If I have to think about how this thing gets tested, let me think about how it’s designed in the first place.” And I think to all the way out to observability driven.
Charity: Exactly. Totally.
Rob Zuber: And I guess, I would say the same thing about observability. If it’s difficult to understand what’s happening in your system and you can’t instrument it, [crosstalk 00:18:45] that should be a little alarm bell should be going off saying…
Charity: That’s a huge problem. Totally.
Rob Zuber: Let’s dig in and see if we can make that.
Charity: I feel like the fact that this… We’ve known this for how long, 15 years? And we’ve had the tools to do this well for at least five. I feel like there’s a real failure of leadership here. What is an engineering manager or director or senior engineer? What is their one job? Their one job is tending to the socio technical systems of people and tools and production and you can’t solve any one of those on their own because it’s a feedback loop. Everything you do has an impact on everything else. We are the ones who are supposed to understand that and make sure that time is carved out to work on these problems. That it’s not just feature, feature, feature, feature, feature, but that we’re loudly advocating for and making a strong case for, “No, this is in the business’s best and interest. No, this is in everyone’s best interest.” This is why I’m trying to calculate the cost in dollars. And how wasteful is this of our engineering? The scarcest resource in your universe is probably engineering cycles and you’re wasting it on this garbage?
Rob Zuber: Right. How would you sort of build a community of understanding around how to model and think about these things in an effective way that’s dollar oriented, business return oriented?
Charity: So how can you measure an engineer’s productivity? I’ll agree that’s literally impossible and you shouldn’t try, except in the broadest of terms and not pegging it to any numbers or statistics, making it a combination of 24/7 feedback, knowledge over time. This is why we need highly technical engineering managers, so they can’t get snow jobbed. Even then, it’s going to be imperfect and we should proceed with humility and caution. I think it’s even more true of organizations made up of those people.
Rob Zuber: Yeah. Well, one thing that you said in there that I think is really interesting is trying to measure an engineer’s productivity. And I think much like the systems we’re building, it’s not usually one, it’s the system of engineers. And what can I learn about the system of engineers?
Charity: That is an essential point. I do think that you can measure teams better than you can measure individuals. I think that every manager should have a graph or a dashboard with the accelerate metrics for their team, as well as a fourth one, which is, how often are we getting paged outside of hours? And you should know the stats for your team, for your product. You shouldn’t examine them every day and be too reactive about it, but it’s good to know if the numbers are increased or decreasing. It’s good to know what your SLO is and if you’re a match or not. It’s good to know. You can tell how a team is doing pretty well from those metrics, but it tells you nothing about the individuals and it’s really toxic if you try to translate it.
Rob Zuber: Yes. A hundred percent. I think, well, one thing it might tell you is that you want to check in with some of those individuals, because you’re like, “Wow, you’ve been pitched out of hours every day for the last seven weeks. How are you doing”
Charity: Dude. That’s not okay. I believe that there’s a compact between. When we blew up the monolith to microservices, and now everybody have a little more ops in their job title, I do believe that any engineer who works for a highly available 24/7 system should be on call for it, should be available for it. But I also believe that it’s managements duty to meet that with, it’s not going to suck and we will dedicate the resources to fixing these systems that it doesn’t suck, so you don’t have to plan your life around it. I think it’s reasonable to ask any of these engineers to wake up, to fix their services once or twice a year, maybe three times, as long as they don’t have a small child also waking up in the night. Nobody should have to carry two pagers. I think that’s reasonable, but I don’t think it’s reasonable to have someone getting woken up every week, let alone every night and your manager should fucking know that if it’s happening.
Rob Zuber: I actually love, there’s always concrete numbers in conversation with you. I will admit that I was trying to be overly emphatic on the nightly, but absolutely. What should that look like? What is our contract or understanding as an organization about what’s acceptable and what’s not? I think is a really important thing for everyone to be evaluating.
Charity: And I say that because I’m trying to get… I’m fine with throwing out bombs. You can’t gen… Honeycomb gets paged way more than that. We’re a startup. It’s up and down, there’s chaos, but we try to couple that with empathy about take the next Friday off. If you’re on call take the next Friday off, this is a matter of course. And we do really care. We take it as seriously as a heart attack when somebody gets paged out of hours. Everyone will jump on, “We will fix it. We’ll take time out of product development. I promise it’ll be once or twice a year, but I can promise we’ll take it very seriously and dedicate real engineering hours to fixing it.” So you do what you can.
Rob Zuber: Well, I hear you. I’ve been primarily a startup person for most of my life and I’ve certainly been that person carrying the pager and doing the customer support and going out and selling and building product, but in those cases I signed up for it.
Rob Zuber: That was me building a company. So it changes.
Rob Zuber: The contract changes as you grow and as you build out teams and.
Charity: It can be so fun if you chose to do it, if you know it’s your choice, you know that you’re rewarded and you valued for it, if all these things, it can be a great period of your life, you just throw yourself on this, but it’s not for everyone and you shouldn’t force it.
Rob Zuber: Cool. So as you continue to evolve and continue to build out the product, as software and how we build it continues to evolve, what are you excited about? What’s coming up that you’re really looking forward to?
Charity: I’m pretty excited about open telemetry. History’s littered with the gravestones of prior attempts at this. Whether it’s open tracing or all of the other… I didn’t have high hopes for it, but I am seeing a lot of adoption. It’s everywhere. It’s not just in Silicon Valley. We’re talking to a lot of teams who are going all in on it. And the thing is, this is really good for users because I think it’s overdue. We should have had more of a telemetry standard a long time ago, but we have it now. Great. Cool.
Charity: It means that if you instrument your systems with open telemetry, you should be able to move from vendor to vendor or try different vendors out without having to re instrument anything. It should be as simple as a config light change, which means so many people are using their vendors because of lock in, or because of the incredible friction and pain of redoing all that work again for marginal benefit, which lets vendors be really lazy. I like this because it will let people try out other tools without having to deal with all the sunk costs. And this will make vendors compete on better experience for users, better tooling instead of just treating them like a given, so it’s good. It’s genuinely catching on. It’s like fire and that’s really exciting to me.
Rob Zuber: Right on. I like it. Well, thanks again for joining. It’s awesome to get your perspective on so many different things. Congratulations on all the success in continuing to grow at Honeycomb.
Charity: Thank you.
Rob Zuber: I think you’re doing awesome stuff. We continue to be excited customers…
Rob Zuber: … and use it for all kinds of things. And thanks everyone for tuning in and listening to this episode. If you enjoyed this podcast, share with your friends, subscribe at your local podcast provider of choice. And if there’s someone you want us to talk to or something you want us to talk about, hit us up on Twitter at CircleCI. thanks again, Charity.
Charity: Thanks so much for having me.