Releasing software with full control
May 14th, 2021 | 38 minutes
Olaf has over 20 years of experience in the internet industry in various technical, architectural, and managerial roles. As co-founder and CTO of the cloud-native release orchestration platform Vamp.io, Olaf has developed an encompassing vision about where the DevOps space is moving. With CircleCI’s acquisition of Vamp, Olaf is excited to expand the already award-winning and proven features of CircleCI, so that the confidence of releasing features to users will become a business-as-usual event.
Rob Zuber is a 20-year veteran of software startups, a four-time founder, and three-time CTO. Since joining CircleCI, Rob has seen the company through its Series F funding and delivered on product innovation at scale while leading a team of 300+ engineers who are distributed around the globe.
Rob celebrates CircleCI's 10 year anniversary with the company's longest-tenured engineer, Gordon Syme, and one of our newest employees, JP LeBlanc, to discuss the company's past, present, and future.
Charity Majors discusses creating Honeycomb, business building practices, and the importance of proper CI/CD and monitoring. Giving us the latest insights on observability and the necessities for engineering team success.
Rob: Hello, and welcome to The Confident Commit, the podcast for anyone who wants to join the conversation about software delivery. This is episode two, and today we’re talking to Olaf Molenveld, the founder and CTO of Vamp. And most recently, as announced earlier this week, a member of the CircleCI organization. I’m really excited to be here with Olaf and talk about everything that they’ve done, building out release orchestration, the journey to this point, and then joining CircleCI and what the future holds. So welcome, Olaf. Thanks for joining me today.
Olaf: Thanks, Rob. It’s a pleasure to be here and yeah, looking forward to have a nice chat and then see what we can work on together.
Rob: Yeah. I’m sure this will be the beginning of many or the first of many, many chats and excited to be doing it as part of this podcast. So let’s start at the beginning or maybe, I don’t know how far back we should go, but certainly, tell me a little bit about Vamp and about starting Vamp and in particular, the problem that you saw, when that was, what was happening in the market and sort of what led you down this path?
Olaf: Yeah, that’s a great question. So actually, the initial version of Vamp was an e-commerce platform. So we kind of pivoted from a microservices hatless kind of e-commerce platform, which was called Magnetic, when we started seven years ago. And I think it’s kind of akin to what [MOLTEN 00:01:47] and commerce tools are doing these days. And we had to kind of build an internal engine to handle the lifecycle management of all these microservices. So basically, how do you release them? How do you scale them? How do you make sure you’ve got zero downtime upgrades? And at that time, we basically build an internal engine, which was called Vamp, the fairly awesome microservices platform, which we jokingly called it. And it kind of stuck, it seems, which did these things. And at that time we did those things with Docker containers, which wrapped around the microservices running on Mason’s Marathon, and orchestration and cluster manager.
And yeah, basically handling the SLA based scaling, handling the release and automation of the release. And yeah, we kind of started working it out. And this engine seems to really strike a nerve with people because we started demoing it and then people were like, “Wow, this is something.” I’m not really interested in that e-commerce platform because we are like a bank or we are like an airline, but this thing where you start have control over the management of these microservices, and we had this little sliders in there where people could actually prep their percentage. And remember, this was seven years ago. It was like, Kubernetes was not on the radar. Istio was not on the radar and that people were like, “Wow. So I can grab the slider and move it from left to right.” So that was kind of how we began.
Rob: I love it. I love that the slider is the core to that whole story. But I think that what I take away from that, that’s so interesting is this level of complexity that we were all starting to introduce around that time, right? Moving to microservices, containerizing those services and starting to lose sight of what was really happening. And to me in that story, the slider is this representation of simplicity, right? Oh, now I have this very clear representation of what’s happening here. It’s not YAML. It’s not a helm chart. It’s not some really complicated thing. It’s just, I want a little more traffic to go over to this version or whatever. And so you talk a lot about release. Can you give me kind of, from your perspective, a description of what release means to you and how that relates to deploy, getting stuff into the environment and then how it’s accessed by customers.
Olaf: Yeah, yeah. Yeah, that’s very interesting because we really make a distinction between deploying, which we see as a technical thing, which is like getting something on infrastructure and making it run and releasing, which is more of a process, which is more of how you expose something to the consumer, which can be another service, or can be an end-user or a UI or an app that consume something. And I guess it comes from the fact that we kind of have a background with e-commerce platforms, content management systems, where you kind of work with more business oriented people that want to control stuff themselves and where the UI and the experience is, is crucial to add the obstruction away from the technical things that you say, no configuration Jaml or [Jacent 00:05:28]. It’s the official kind of thing. What am I actually doing?
Rob: So you talked a little bit in there about some evolution, right? Early days and what was interesting to you, and then the environment that you were operating in, right? A lot has changed in the last seven years. You talked about Mesos as the kind of initial platform that you were working on. And I feel like 2017, because I remember saying it a few times, was the year that basically Kubernetes obliterated all of the competition from a container orchestration perspective. And so what else changed in that environment? I mean, you had a core thesis, right? I guess, around release and managing that as a separate piece of the overall process, but a lot of things were changing in this space. What else impacted how you thought about doing release and what you needed to do to meet the needs of customers?
Olaf: Yeah, I think… Yeah. I mean, low balancers and proxies where in the beginning, we layer four, focused at two to technically low balance things. And then the layer seven kind of low balancing became a little bit more apparent where you can do a protocol-based HSB or request headers or agent types kind of segmentation. And then all of a sudden it started to become more of what we are already used to from analytics and testing where you could say, I create a segment or a booklet of users, maybe mobile Chrome users in Germany. And I want to expose 5% of those users to my new API version or my new UI version, and I’m going to observe what the effect is of that exposure, so I can kind of technically, or business wise compare.
And I think that realization, that the tooling was already there, but basically how you apply it to your process and how you kind of focus it a little bit more on to validate if the thing actually works as it should work, and that you can have these functionalities really became apparent. Of course, then the big challenge was how to technically, or how to kind of make this in a usable, friendly way to start doing these things. I guess that’s kind of what happened around the time.
Rob: Yeah. I mean, I think that you’re absolutely right. The tools were there, but probably a bunch of pieces and part of what you ended up bringing to the table was the ability to manage those in a sort of simpler, cohesive way. I talked about simplicity, but you talked about sort of demoing this, like we’re building an e-commerce platform, which by the way, is the story of so many developer tools, I think, right? We had to build a really interesting piece of technology in order to solve what we thought was a business problem, but the technology ended up being more of a standout. And so that ended up being what we focused on.
And so I think that part’s really interesting. But as you showed it folks, right? You mentioned banks, airlines, whoever you were talking to, what was sort of the problem that they were facing, that it felt like this was really a great solution for? And you talked about progressive rollout or progressive delivery. I mean, were they introducing too much risk in the environment or what was it that they really saw? Where did the light bulb go off?
Olaf: Yeah. When we started or we, as an industry started introducing this concept of microservices, effectively creating distributed applications, the complexity really increased. So it became much more, even more than before a technical play and also with Kubernetes and all these obstruction layers. So basically, the people ends responsible for delivering functionality to end users where you were kind of pushed out of the loop almost, maybe not on purpose, but they really felt like, okay, what’s what’s happening here? They were asking for, “Is it ready yet? Can I kind of call my client and let them… Or can they click around and see?” So yeah, the complexity really increased and I guess that really forced everybody to how can we kind of make this work together again and make it cohesive or coherent? Can we kind of use those separate pieces and make them work together as a whole, as something that’s actually useful?
And also more experimental mindset where you can try out things without having a full-on production issue, which I think is also crucial because if you want to move faster, the idea is that slicing up a monolith, that you can move faster and do smarter and smaller durations. But what we were seeing were that people were still doing one of ethical deploys, basically. They were kind of packaging all these little services, which versions that are depending on each other, and then making one big release again. And I guess that to my mind, that’s the big challenge. That’s where the big failure lies. If you can really slice this out and then push little incremental dates on little pieces and validate if the effect is actually what you expect it to be.
Rob: Right. Right. I think that that makes a ton of sense to me in that doing a monolithic deploy as you described it, or a monolithic release, let’s say of, we know all of these versions to be good together. Now, we are going to upgrade all of our production environment to be at these versions is an attempt to solve a problem of complexity. Right? It’s difficult to reason about the state of my system. If I think about what the application is, it’s no longer a monolithic code base, right? It’s comprised of all these services and therefore getting my head around as a consumer of that, maybe not the end user, but a stakeholder who’s trying to understand. As you said, can I tell my customer now that this thing is available?
I don’t know, because all these different moving pieces. So now we’re going to say, “Okay, we shipped this, here’s the release notes for this stamp diversion of a hundred services or a thousand services or whatever that might be.” Again, an attempt to address complexity, but addressing it by trying to almost put it back in the box in just a strange way versus accepting our environment is now fluid, changes being introduced in all these different ways. And how can I manage for that? How can I give a view that’s simplistic for people to understand, right? And to know that they’re releasing things in a safe way also, but do it in a way that is safe to make those changes without coupling everything back together. Right? I mean, I’m pretty sure a big reason that a lot of us have gone down the services path is to break apart coupling between teams and parts of the organization and allow more rapid delivery, but to lock step that all at the end and then put it all out at once. [crosstalk 00:13:27] really impacts that.
Olaf: Yeah, I guess it also has a little bit to do with there’s still the silos, there’s this operational responsibilities, which is kind of like, don’t change too much. Let’s up time and security and then there’s developers or engineers that need to develop new functionalities. And you saw the shift kind of practice self-service for software developers, future developers with Kubernetes and then those layers, but still, it needs to end up in some operational environment. And there’s a ton of technologies there that can be leveraged to kind of make it more risk-free and do it more smarter than this releasing process. But yeah, if these departments don’t collaborate, don’t work together or have conflicting kind of targets, then there’s a huge missed opportunity there. And I think that’s kind of where we are in this space where we say, okay, there are low balancers and proxies and all kinds of things that we can kind of leverage to stretch this deployment process into releasing and start doing it more risk-free without having to touch kind of flags in the code or combine those things.
Rob: Right. Well, interesting that you said risk-free in there. I mean, I’m a big proponent. I think we’ve spoken enough that I know where you stand on this, that you’ll never eliminate risk. Right? So you have to kind of embrace it and then think about how do I mitigate it, right? How do I make the risk level appropriate for my business, for this particular piece of the business, whatever that might be? And I think that managed release or release orchestration is a big part of that, right? I am going to ultimately discover some kind of issue in production. That’s the thing that’s going to happen, right? Which we all should just accept. That’s a thing that’s going to happen. I think we have all accepted it. We just haven’t quite figured out what to do next. Right? And so, I don’t know if I remember your example correctly, but 5% of German Chrome users, right? Are going to get… Are going to be directed here. Whatever it is I’m trying to account for, let’s make that a small slice.
And then when I see that something’s going wrong, I can back that out, right? Effectively, instantly, and I’ve shut off. So I’ve reduced the total number of users that are impacted. I’ve reduced the duration of the impact. And I have new information that maybe I was just not able to gather in a testing environment. Right? Obviously at CircleCI, we’re big fans of validating everything you can before you put it in production, but we all know there’s going to be an issue out there. So I think that model of risk management I’ll call it, and accepting that it’s never without risk, but it’s sort of the tolerable amount of risk for your organization and for that particular application, right? I mean, you mentioned, airlines. I’m assuming you have some history with them. And I would guess in an airline they’re areas where I’m comfortable with risk and areas where I’m not comfortable with risk, right? Or the kind of level of risk would adjust to that particular application within that business even, if that makes sense.
Olaf: Yeah. Yeah, yeah, yeah. I mean, the way… Yeah. Testing in production, it’s more like a engineering kind of thing or development kind of thing. But if you talk to more like business roles, especially with more like finance and airliners and the lack testing in production, that’s not an option. Failing. We need to test everything pre deployment, and yeah, because we engineer it, knows like, yeah. Like you say, we cannot test everything. And especially with risk and compliancy and privacy concerns, the data from in production is always little. You cannot copy it or move it around even if you want to. So how do you address this kind of reality? To be honest, we started talking about not failing fast, but verifying fast or validating fast, which is more like a positive approach, but yeah, you need to have the safety nets in place.
That’s the thing. I think there’s this quote, I think Joe Itow said it, “If you want to increase experimentation, you have to reduce the cost of failure.” So that’s kind of what you need to do. You need to… It’s like if you drive in a fast car, you wear a safety belt, you put these things in place, then you’re not going to risk everything away. And yeah, so I guess the moment these components are in place. The mindset also changes and people kind of like can go with smaller durations. And like you say, you will never not test before you release or deploy, but maybe you can kind of move a little bit faster and kind of accept the fact that the production environment is the final kind of real test, where the real prediction data, this and that, and the user patterns exist that you cannot predict, or it’s like a random… the real world. People do strange things.
Rob: Yeah, absolutely. I think the sum… the key thing that I always take away from this conversation, and you mentioned, obviously we’re talking about risk and cost, is looking at it from the other side, right? I can probably, even in your financial models with private data, I can do a lot of work and create something that looks very much like the production environment or closer and closer, right? To some asymptotic point, but every level of investment costs me more. And at some point, the risk or cost that I’m avoiding is lower than the risk or costs that I’m creating. Because not only is it costing me time and energy, but there’s real opportunity cost, right? I’m not moving as quickly. I’m not innovating. I’m not experimenting on the next thing because I’m trying to get the last thing perfect.
And so it’s a very business specific, right? I mean, I can’t speak to financial markets, for example, although I’ve certainly run into some friends who have made very expensive… shipped very expensive bugs, I will say at the scale that the markets themselves would not allow the trades to occur that sort of thing. But I mean, we know there’s sort of the classic stories of funds that have shut down in minutes because of bugs and software. And so that’s understanding that and then comparing it, right? How quickly can I ship new algorithms? How quickly can I adapt to the changing state of the markets with the code that I’m shipping if everything has to be perfect? And what’s the trade off on the other side? And if I’m SpaceX, probably my risk threshold is different than if I’m building a blogging site, right? And really understanding that as your business, I think is important. And then being able to use tools like this, to manage to your risk and comfort level.
Olaf: Yeah. I think that you hit on a few interesting things there. It’s like risk appetite. Like you said, some services or futures are less risky than others. And yeah, to facilitate the business case behind it. Does it actually makes sense that we… Yeah. I think we’ve all been there where kind of like there’s so much effort and time spend on certain functionalities that basically it’s kind of like a self fulfilling prophecy. Nobody’s kind of asking anymore, does this thing actually work? Or what’s the ROI on this thing? We already spent months of testing and debugging this and it’s got to be a success and if you can move faster, but that makes total sense. And yeah. Also, having the option to what we call the big red button. If you see something failing that you can kind of add a slider again, maybe slightly back, press the red button and fall back to what’s already there is super powerful because yeah, you’re in control.
And then if the system also tells you like, okay, for X amount of seconds, this segment of users, this amount of requests got maybe a dodgy version and you have clear visibility on the impact. And I think people are not so much wary about things not being successful or failing, but it’s more about if you’re unclear, there’s lack of visibility and it’s lack of control, then it becomes scary. But if you can really have the big red button and see what kind of, what is the effect and you can segment or compare it to mobilize who’s being exposed, then that’s a different ball game. And then you can still decide if you want to do it for a certain feature or service or not. Yeah.
Rob: So we keep talking about moving the sliders and the big red button and stuff. So that’s all… I mean, that level of quick adjustment, right? To be able to say, dial this down to zero, because it turns out it’s not working the way we expected versus, oh my goodness. We need to push another build, which… I’m still a fan of being able to get a quick deploy out. But if you can just reroute traffic to a known good situation, that’s much faster, right? It’s going to be hard to beat no matter how good your deploy times are. But a lot of that is still talking about manual intervention and watching graphs. But I guess, how do you see the ability of folks to effectively automate that and say, under these conditions continue to progress the rollout, under these conditions or release and under these conditions slow it down or take it back?
Olaf: Yeah. Yeah. I think this is perfectly doable. I mean, the more data that is accessible to you, the more data yet you can kind of observe, the more automation becomes valuable. And of course, there’s anomaly detection where you can kind of compare to a well… a good running situation. There’s all kinds of technical metrics that you can kind of already deduct and see if something is running. And of course, because there’s this layer, four layer, seven. There’s network information that you can also add to the equation. So yeah, I think basically it will end up as a multilayered kind of automated cessation, where you as a human will tell the system how you think good looks like, which is more like a SLO kind of thing or SLAs where you say, okay, we need to be always this fast or this amount of time, but in the end, what does that mean from a technical perspective?
What kind of data sits below that, that kind of a high level layer? And I think, yeah, the system has basically figured it out by itself. They’re much better suited for it. And to be honest, I mean, SRE kind of thinking, what error budgets and all kinds of different types of metrics from histograms to counters. And it’s hard, even if you grasp what these things do, what do you need to enter? What time windows? Wow. You need to analyze the historical data anyway. So why not let the system figure that out?
Rob: Yeah. Well, so as you were talking about that, I was thinking, I mean, I know first of all, Vamp has a lot of that data integrated, but at some point I think about it a little bit, like I’ll compare it to self-driving cars. And my comparison is interesting for me because I still drive a manual. That’s probably more common in Europe. No one in the United States drives a manual car anymore. And so clearly, we’ll just call it control freak, or very interested in having control. And the notion… I like the idea of self-driving vehicles, but then I asked the question, at what point would I get in the back seat? At what point would I just completely leave it in the hands of automation to be hurdling me down the highway or whatever? I’m not sure. I’m not sure.
And so, it’s a little easier to reason about the idea of your car parking itself in the garage or something like that, which lower kind of impact. And so I wonder, at what point, as you said, you could let the system do the work, analyze historical data, say these are the kinds of things that we want to be looking for or even everything looks consistent with history, except for this thing right here. That seems weird. So we’re going to dial that back, alert someone and they can come investigate. Whatever that might be. But is there a point at which you think people stop watching? Do you know what I mean? Where it’s just like, we ship, stuff goes out, we know that we have good controls and if anything goes wrong, it’ll be taken care of. And so I can just get on with my next thing? Or are we still, I feel really good about this. I have the tools to make a change, but I’m just going to keep an eye on it while it goes out?
Olaf: Yeah. I think there’s… That’s a very interesting one. And I think I’ve got the same kind of thing, like control. Trust is good. Control is even better. And I think you need to build trust into the system, of course. At some point things work out and then you’re like, okay, this seems to work. But I think there’s a clear distinction between more experimental kind of, let’s see how this will work and let’s try it out. And then you obviously want to kind of sit behind the wheel. You want to kind of control what goes on and then observe very closely how these things work. Maybe not from a technical perspective, but more maybe from a business or impact or a business case that you think might be improving. But there’s always like this, I think technical undercurrent that will also work.
And yeah, I feel that this kind of, we call it maybe it’s the water or electricity running where you kind of at some point accept or expect that this will just be working. So I think you will be… And nobody wants to have this overload of information that way of noise and all kinds of things. So what I think we can kind of filter it down to a few very crucial things that you’re interested in and basically observe those. And then only let the system alert you when these sort of things become anomalies and they become maybe a trend that goes to the wrong direction and say, maybe you need to take a look. And for a lot of it, we don’t have any opinion, of course. We don’t know if it’s good or bad, but we can at least inform you like, okay, something is kind of start to move. Hope that kind of makes sense.
Rob: Yeah, absolutely. So I want to take a few minutes as we’re winding down, just to talk about how all this fits together now. I mean, I’ve been excited about… Well, I was going to say I’ve been excited about risk for a long time. I think thinking about risk, I’ve been thinking about risk and thinking about cost and how that all plays into how we make engineering decisions about how we deliver software. And that’s fundamental to what we do at CircleCI, fundamental to what you’re building at Vamp. And obviously, we made the decision to put these pieces together. I think for all the reasons we’ve talked about, more inevitability or more understanding of the inevitability of issues in production and that being the final stage of validation, as you release things out to your customers, there’s a really clear connection between all of these pieces to me, ultimately at CircleCI where we think about the ability to quickly make change, right?
You’re trying to get things out into the hands of customers. And as we… I mean, when we talk about software delivery, we talk about diffs, we talk about change sets. We’re thinking about stacking change on top of change. And so how do I manage the flow of that and at any risk around that and make sure it’s good and make sure in my mind you could have a green build from a CIN test perspective, but if it goes out into a production environment, we’ve got starts to get used by users and it doesn’t pass muster, then it’s not really green, is it? Right? There’s something wrong with that. And being able to tie that together is really fascinating. So with that very leading intro, tell me a little bit about how you think about tying those pieces together and what excites you about kind of connecting this whole picture?
Olaf: Yeah. I think, and I agree. I think I had to move to distributed applications with all the promise around it of the scalability and more control also gives a challenge how you kind of manage these dependencies. And typically, like you say, my build is green and my dashboard looks good. I’m done. I can continue with my next kind of epic or whatever you… But obviously, there’s this holistic thing, which is your application landscape, which is what your end users are experiencing. And I think this kind of combination of technical focus and what my end user is experienced. And I spoke to a person once and he made a really interesting remark and said like, “End users don’t care about five times nine. If you want to order something at a certain website, you just want to order, it needs to be there and it needs to be snappy. And how do you know that you’re in that window of the five times nine?”
So I think the holistic view of the entire application and how we approach it, and also from a technical perspective, like did my update, even though it technically looks good, how did it impact on a positive or a negative way, the experience of the entire application for our end users? That kind of feasibility? That’s really interesting to me because it kind of moves it from a technical or operational perspective into, we as an organization are serving our clients and what is important for our clients and what is the impact of what we do as a team and maybe our update did negatively impact another servers and impacted the user experience. So I cannot kind of ignore that fact and say, “Yeah, my dashboard is green. I did this.”
And again, it’s not on purpose that people work this way. It’s often because of lack of tooling, of control, of observability on this. And I’m super excited in allowing teams to work more in this way. And I think that kind of holistic view of the entire application landscape and how it’s experienced by end users. I think that’s super interesting because in the end, we’re all trying to kind of build something for end users to experience and that should be really the focus of what we’re doing as an organization.
Rob: Yeah. I love that perspective. I think that we’re so good in many ways at breaking down what is fundamentally this holistic user problem into tight little domains and sub domains, so we can work on our piece and have autonomy and all of those things that allow us to move quickly. But if we’re not, if we get disconnected from how the user is experiencing the system, to your point, the five nines, doesn’t matter if I’m in that tiny little window where I just need something right now and the system is not available, right? So I think that’s a great way of thinking about it. I think that a huge thing that we’ve leaned on as developers over the last decade at least is feedback, right? As we’ve broken down our units of work and ship things more quickly, ship them in smaller increments and get feedback faster. All of that has been great.
Part of that feedback is what happens out in production, right? It’s not just the green check that says, “Hey, my build passed. So I’m going to move on with my day. Everything’s great here.” Right? It’s like, how did customers experience this? Not just from a release perspective and was it fast? Did it… Was it error free or limited number of errors? Whatever. But are the business metrics good? Right? Do we understand that all of the things we broken down are ultimately proxies for the success of our users and can we make sure that we really see that success? And so I think… I know I’m super excited to continue to work on this. I’m excited to work on this together. I think it’s a natural extension of everything that we’ve been doing as we try to make our customers successful with moving quickly, moving with confidence and shipping value to their users.
I believe we have a lot of common perspective on that, and I think we’re going to make something pretty cool together. So thanks for joining me. It’s been great having you both on this podcast and, and in our organization. And if you’re listening to this, if you’re a CircleCI customer, I will tell you that there are exciting things coming. Hopefully, you’re excited about what we’re doing here. So watch this space or probably more likely our site, and I’m excited for what we’re going to package up and put out there together. Thanks again, Olaf.
Olaf: Yeah, you’re welcome. And I can only repeat that message, really excited to work together on this, and let’s see what kind of cool, valuable things we can pull off together.
Rob: Right on. Thanks everyone for tuning in. If you like this, or you want to give us some thoughts on who we should be talking to or what we should be talking about, find us on Twitter at CircleCI, subscribe in all of your favorite podcasts services. Talk to you soon.