Episode #118: Deploying on Fridays with Charity Majors

November 8, 2021 • 48 minutes

On this episode, Rebecca and Jeremy chat with Charity Majors about the role of ops in a serverless world, why deploying on Fridays shouldn't be a source of anxiety, the importance of single merge deploys for fast feedback loops, her new book on Observability Engineering, and so much more.

Charity Majors is the co-founder and CTO of Honeycomb. Before that she worked at Facebook, Parse and Linden Lab on infrastructure and developer tools, and she always seems to wind up running databases. She is the co-author of "Database Reliability Engineering" and the upcoming "Observability Engineering: Achieving Production Excellence" book published by O'Reilly.


Jeremy: Hi everyone. I'm Jeremy Daly.

Rebecca: And I'm Rebecca Marshburn.

Jeremy: You're listening to Serverless Chats. Hey, Rebecca.

Rebecca: Hey. I am so excited, that I'm going to say no banter for us today. Tell us who we're listening to.

Jeremy: Yeah, I agree. We have an amazing guest today. Our guest is the CTO and co-founder at Honeycomb. Co-author of Database Reliability Engineering, and the upcoming Observability Engineering: Achieving Production Excellence book. She's a regular speaker, a writer of some absolutely amazing articles. Charity Majors is with us today. Hey, Charity. Thanks for joining us.

Charity: Hey, thanks for having me.

Jeremy: We noticed, we were talking quickly before the show here. We noticed you were doing some painting. Besides painting, what else have you been up to lately?

Charity: What else have I been up to? Gradually starting to get back to the world. I went out for drinks with Norah Jones, Wednesday night. Then I was supposed to go out last night, but I realized that going out once on Wednesday blew my entire social budget for the rest of the week. I just, it's too hard. I couldn't do it. I'm going to have to work my way back up slowly. The introverts.

Jeremy: Right, well I'm getting on a plane to go to Reinvent in a couple of weeks-

Charity: Holy shit.

Jeremy: I'm like, "I haven't been on a plane since February of 2020, I think," when I came back from Nashville. Somebody else posted the other day, they're like, "Traveling for the first time since the beginning of the pandemic. Don't think I remember how to do any of this stuff." I'm-

Charity: Basically.

Jeremy: I don't know. It's going to be an adjustment, getting back into all this stuff.

Charity: Yeah.

Jeremy: Anyway, we have a whole bunch of things that we would love to talk to you about today. I wanted to start actually, with an article that you wrote quite a while back, which was called The Future of Ops Careers. In that article, you actually called out the serverless musical that I did. The Lambda serverless musical.

Charity: I did, didn't I?

Jeremy: You said that, although you liked the song, that you thought it would be better, rather than saying reducing ops, that serverless reduces ops, that it really should be more about improving your ops. I totally agree with you too, and I think we had a little bit of banter back and forth on Twitter about that.

Charity: I also completely concede that the rhyme was better the way you did it. It was snappier. I get it.

Jeremy: Well I think that was my argument. My argument was-

Rebecca: Style points.

Jeremy: There was a lot of rhymes going on-

Charity: It was snappier. I get it.

Jeremy: ... in that, so it did work out well.

Charity: Yeah, yeah.

Jeremy: But actually, I think you bring up a good point. Because I totally agree with you. One of the things I never tried to do, or I certainly don't try to do, and I know some people do, is to equate this idea of serverless with no ops. I don't think in any world, that serverless is going to allow you to completely ignore ops.

Jeremy: I'm just curious, from your standpoint. In that serverless world, I think you reduce certain types of ops. But you can also do other ops better. What's your take on that? What do you think serverless enables ops teams to do better?

Charity: Well fundamentally, operations is just the practice of delivering software to users, and doing it well. You can't, if you have users and you have code, and you want to get one to the other, you can't have no ops. What was your question again?

Jeremy: In the serverless world, if you are reducing certain-

Charity: Oh, yeah. Yeah yeah yeah. Totally.

Jeremy: I think this goes back to maybe this idea of toil. I think that, unfairly sometimes, operations gets put into this bucket of toil, and I think-

Charity: For sure, because they're the defenders of last resort at the castle.

Jeremy: Right.

Charity: Right? If the dragons have gotten through all the other teams, and all the other soldiers, ops is the ones who are going to make sure that there's still a company tomorrow after the outage. The shit lands on us if it hasn't been fixed further upstream, indefinitely. I do think there's a real place for SRE teams, or ops teams, or whatever you want to call them in a serverless world, while also acknowledging that the lines are blurred. Most ops people are now software engineers as well, because it's all about automating. It's all about moving up the stack.

Charity: But there are a couple of different models that I've seen companies be really successful with. One of them is using ops as internal consultants, basically. They're expert in systems. You own your code, but ops, they know how to get the alerting, and the SLOs. They know what good software systems smell like and look like, and they know how to make sure that you're integrating the security, and all this stuff. Which is a very real basket of expertise, that most software engineers don't have, and never will, because they're not being exposed to quite as many fires on the front end.

Charity: Fender is one of our customers, who's heavy into serverless. They have a great, world class SRE team. The SRE team is focused on enabling and empowering the software engineers to own their code in production. Making it so that the observability tooling uses reusable libraries. That there's a naming schema for the fields that you're using, and making sure that, when reliability is going down, that you've got the coach there to help you figure out how to get it back.

Charity: Yeah, I think that we're getting better. We're such pessimists in this industry. But we're legitimately getting better at doing software. We're able to do more with less, at an accelerating rate, and more and more teams are leaning heavily into the things that we've been talking about, ever since Jazz and them published the old CICD book. These technologies and these practices are finally coming of age. The speed of processors, and disks, and everything just keeps getting cheaper, and so we're able to collect more, and do more with it. Blah blah blah. But things are getting better, and I think this speaks to a lot of the movement that SREs and ops have been going through in the last few years too.

Rebecca: You're talking a little bit about this, almost what I would say, you said blurry lines, right? But maybe it's a permeable membrane between ops folks, and, I don't even want to say ops folks. It's two separate buckets. But let's say ops people, as you said, are even turning into software engineers, and those things are related-

Charity: I've always like the term systems engineer.

Rebecca: Yeah, systems engineer. I do too.

Charity: I feel like it's almost like we're the code pessimists. Yes, we can write code, and we will do so as a last resort. After exhausting all other possibilities, which is a really good thing to have in your system, is people who approach it from that perspective, instead of just, "More code. More code. More code." Ops people see more code, and they're just like, "Ugh."

Rebecca: Yeah, so we think about that idea of systems engineer. I'm wondering if you have any advice for ops minded people, who maybe need to also help reconfigure their brain, and also move up the stack, and be a systems engineer? Perhaps less focused on what used to be, "Here is the ops bucket," and then, "Here is the engineering bucket." Do you have any advice for them, to be like, "You might be looking at the wrong skills now. Here's how to perhaps think about it, in this world we're in"?

Charity: Yeah. Well that's why I wrote that piece that Jeremy referred to upfront, where I quoted his wonderful rap. Which, if you all haven't listened to it or watched it, you really owe it to yourself to look it up. It's fantastic.

Charity: But yeah, I think that there's a divergence. There's a fork in the road that's coming up for ops people. This has been a long time coming. I remember when I got started, when I was 17 and in college, I was the sys admin for the math stat department. This involved everything from going down to the colo every time the database went down, to flip a switch, or to manually jiggle the RAM. There were no remote switches, or anything like that. You were responsible for everything. If you were doing databases, you had to figure out the RPMs of the disks, and how to position the indexes at the right ... You were doing everything.

Charity: I for one never want to have to go to a colo again. I'm delighted that AWS does that part of my ops for me. They're better at it than I am, and that's all great.

Charity: As systems engineers are moving up the stack too, it used to be that any technology company had to employ just an army of folks to do everything from, hell, I used to be the mail admin. I was so good at Postfix, and Clam AV, and antivirus software and all this shit. I'd run IMAP, and debug it for the CEO. We had to run mail. We had to run DNS. We had to run, just everything. That was really costly, and it was really expensive. It was a lot of territory to cover.

Charity: Now, the reason that we're able to move so fast, and have so much innovation, and so much speed, and startups that are spinning up every day, is because so much of that has become commodity. So much of it, we don't have to deal with anymore.

Charity: Instead of being this full stack systems engineer, to coin a phrase, I think that the only companies that really need that experience, are the closed systems. The Facebooks, the Googles, etc. If you're in infrastructure, you've really got two choices. If you love your infrastructure, you want to eat, sleep, and ... If you love being a mail admin, you don't want to give it up, you have to go work for a company that does that as a service. You have to go work for an infrastructure company, so that you're not a cost center, that just is trying to be minimized.

Charity: You need to go where the stuff that you do is best in class, for the entire world, providing it as a service. Whether that's, I used to run mailing list software. Now we've got all of these companies that do it for us. You've got to go somewhere, where what you want to do is the specialty. It is what you are doing.

Charity: Or you can go the direction of being a high level ... Somebody who understands systems well enough. But your job is not to build software, it's to empower software engineers. You're almost more like an in house coach. You're the one with all of the industry experience about how to do SLOs, and when is the right time? Is your backup strategy sane, and redundant to the right numbers of reliability?

Charity: Your role is almost more of, one, being an educator than anything, a lot of times. It's about helping your company figure out when to lean into trends, and invest in them, and when it's a waste of time. Your job is to make sure that every fresh young engineer that comes in the door gets indoctrinated in the ways of the system, and the on call, and the pager, and learns how to pull their weight.

Charity: Sometimes you may jump in, and you may be the release engineer. You may be the CICD. All of that socio-technical stuff that goes into running systems. But it will be constantly changing. It will be a thin ... It's never going to be a large cohort of people. It'll be a ratio, maybe one SRE to every 10 or 20 software engineers. It's a very senior role, I think. It's really hard for me to see where the new junior systems people are going to come from, which is a whole different topic. I'll stop now for breath.

Jeremy: Well I was going to say, if you haven't administered or ran your own mail server, then you just haven't lived. Those were the days-

Charity: Amen.

Jeremy: I remember that as well. I remember the first-

Charity: God, I miss grubbing my mail spool, don't you?

Jeremy: Right, yeah. Exactly. Or the first time I got a remote KVM for my co-location facility, it was magic. I could control machines remotely-

Charity: Happy birthday.

Jeremy: Oh, it was so-

Charity: I know.

Jeremy: I think you make a really interesting point though about, there was this class of engineer or operations person, that was a systems engineer. It was managing those actual systems. Then we move to a point, where even the configuration of some of these systems, even some of the failover or the tolerance, some of these things are being built by the developers who are coding these things with infrastructure as code, and handling some of that.

Jeremy: But if we go back to the toil thing, I always looked at my maintenance of my infrastructure, when I actually had to go to a co-location facility at 2:00 AM to swap a drive. That was just things that were, it provided value, in that I kept the system up and running. But it really added no value to the work that I did. I didn't feel valuable going and swapping a drive. But-

Charity: No, there was a lot of wasted time and effort there.

Jeremy: A lot of wasted time and effort. Now you move to where you've got these massive data centers, and the cloud's running them for you. You've got managed services, and you move to this next level. You said something like a CICD engineer, or something like that. If I'm a developer, and I'm trying to move fast, I can learn Lambda, and I can learn Fargate, and I can learn Kubernetes maybe in terms of how I might do some configuration there. Maybe I learn Dynamo DB, or some other specialty services.

Jeremy: But do I want to go deep on IAM? Do I want to go deep on the CICD release process? Do I want to go deep on making sure all my KMS keys are rotated on a regular basis?

Charity: God no.

Jeremy: I do think that there is a lot of work that can be done, that still, I hate to call it toil. But the things that are specialty, that someone needs to learn, that still-

Charity: It's platform engineering.

Jeremy: ... your developer doesn't want to to.

Charity: Is how I think of it. It's platform engineering. I don't like to think of it in terms of what the developers want to do or not. Because they can be prissy little fucks sometimes. But the point is-

Jeremy: We'll put the explicit thing on this episode. That's fine.

Charity: Sorry. But they're not wrong sometimes when it's, this is a thing that, we don't want every developer to go off and solve it in a different way. We want there to be a golden path. We want it to be well researched. We want it to be flexible, but not wiggly. We want to make sure that there's standardization. We want to make sure that the choices are well understood.

Charity: If it's something that is being built for the platform, then you'll get into the situation where people are, it's like musical chairs. The first one to hit it has to take a detour for two or three weeks, to build this authentication thing or something. That's just, it's a cost that's being born unfairly by some developer.

Charity: Now, there is still definitely a role for building platform stuff. That is a pretty software engineering-y role, but it's a hybrid systems and software. It does assume a lot of good judgment. So yeah, but I do think that platform engineering teams are, I think slowly taking over a lot of the territory that used to be SRE ops, and absorbing a lot of those people.

Charity: We actually, at Honeycomb, we don't have any ops team. We have a platform engineering team, which has one self described SRE on it. We're getting pretty large. But the only reason that we've been able to do that, of course, is because we had ops expertise from day one. So we never got into that hole that most companies get into, where they're like, "Oh fuck. We have a bunch of code. Now we have some users. Shit." Then somebody has to come in, and undo everything they've done, and do it correctly, and all that stuff.

Jeremy: Right.

Rebecca: Charity, what I love about your answers, is that there's a lot of philosophy baked into them. Whether or not we couch it as philosophically, "Let's talk about systems engineering," or, "Let's talk about where ops is going." There's another philosophical umbrella that you hold really dear, which is philosophies around deploying on Fridays. I know that's one of your favorite topics. Honestly, if anyone is interested in reading more of your thoughts on the matter, we will both recommend, Jeremy and I, that you just Google, "Charity Majors Friday deploy." That's it. It'll be a good surprise for you.

Rebecca: But Charity, I'd love maybe if you could start with your general thoughts on Friday deploys. I'd love to set the baseline for the Charity Majors philosophy on this.

Charity: Sure. This is some place where I think people like to paint me as a radical, and just a fire breather or something. I'm not. I am from ops. I am a pragmatist at heart. I am not trying to draw a line in the sand and be like, "If you don't do this, you are terrible." My point is just that, if you are unable to deploy for a solid 20% of your week, that's not great. That's not something to aspire to.

Charity: Some people will be like, "Well this is a sign of how much I value my people, and their weekends and everything." I'm like, "No it's not. If you really value their time that much, you wouldn't want them to only be protected from this on Fridays. People get paged overnight all week long. If you really care about your people, you will fix this problem, so that deploys are safe, reliable, easy, easy to flip back and revert, easy for engineers to deploy their own shit. That should be a nothing-burger, right? It should be, it's the heartbeat of your system. Heartbeats should be regular, predictable, trivial, like nothing. Small diffs, one engineer at a time. As you go out constantly, all day long. If you've done these things, it will make absolutely no sense to you to block off Fridays."

Charity: There are some places where some people have extreme situations. Some people, blah blah blah. I don't know everyone's specific situations, and there are some situations that I'm like, "Okay, that's valid." But for the most part, I hear a lot of excuses. For the most part, I hear a lot of shit that makes me see that they just haven't done the work. If it takes them an hour or two to get their code out, I'm pretty sure they're batching together a whole bunch of engineers to code at once. Which, there are five things wrong with this, you know? If it takes you that long to get shit out.

Charity: The amount of time it takes you to get a single line of code out into production is extremely telling. Can you do that in 15 minutes or less? If so, you can probably ship on Fridays, no problems. If you're doing auto deploys, if you're ... It all goes together. I just don't like seeing it used as an excuse to not improve. If you can't get there, I get that. Sometimes people are dealing with a lot of legacy shit, or they've got external customer requirements. I'm just saying, you can probably get a lot closer than you are now, and it will be nothing but benefit to your people. Every step that you can get towards being able to auto deploy regularly, will help people, will save time, will make your engineers' lives more joyful. It's win win win win win, and so I don't understand why people continue to just radically under-invest in this area.

Jeremy: Right, and that's one of the things that, if people go and read a lot of what you write, you take the stance of, "There's no reason why you shouldn't be able to deploy on Friday. It shouldn't be a hard and fast rule." A lot of what you talk about, and I think a lot of what we see, is the idea that people just find deployments to be very risky. Which is really scary, when you think about it. That people are like, "If I deploy this, I could bring the whole system down."

Jeremy: You mention this idea of single or small deployments. Single individual developer deployments. You talk about all this stuff. Again, I suggest people go and read your blogs, and the things that you've written, because there's a lot of-

Charity: The thing is that, we have this instinct as humans. It's an instinct that is just ingrained in us. When things are scary, we freeze. When we're scared, if we're walking in the woods and we don't know what we're going to step on, we slow the fuck down. The problem is, with software, speed is safety. This is just something that we have to get our heads around. That speed is safety. The more quickly you can do it, the smaller the diffs you can do it, that's how you make ...

Charity: If you freeze up, and you try to enact more control and more barriers and more gates and more processes, you're only hurting yourself, a lot.

Jeremy: Right, and that's what you talk about. When you're saying small diffs, if people aren't sure what you mean by that, you're talking about doing small deployments, or merging code. Single feature, or even smaller than that, I guess. Merging that code as quickly as possible. Getting that into production. Then you have another thing that you talk about, observability driven development. Where you're building in, basically build in the tools you need to that code, so that you can observe it, and you know if it works.

Jeremy: If you can do that quickly, and you can put that code into production, if something breaks, right after you wrote that code is the best time for you to go back and fix it.

Charity: Absolutely.

Jeremy: Not for you to wait a month later, and then be like, "I don't even remember writing this code, let alone how it's supposed to work." Maybe dive into that. Why are these single merge deploys so important?

Charity: It's more about having one engineer's changes at a time go out, in large part. Because if you know that your code is going to be live in 10 minutes, you're very likely to go look at it. If you're pretty sure that your code, and up to a dozen other people's code, is going to go live at some point in the next day, you are absolutely never going to go look at it. That point-

Jeremy: You're going to blame it on somebody else probably, too. Like, "Oh, it's probably their fault."

Charity: Right, sometimes you're ... Then all of those dozen people have to interrupt their day, to try and figure out whose diff broke the thing. It's just a whole mess. But if you can get that feedback loop down, if you can write interpretation as you're writing your code, with an eye to your future self, "How am I going to understand this." Then you merge it. If it auto deploys, I love auto deployment.

Charity: This is the best thing I think we did at the beginning of Honeycomb, and it was unwittingly. It was just, we were lazy, so we made our CICD pipeline automatically deploy to production every 10 minutes. Just having that loop meant that we grew up with it, right? It was never scary. It was never hard, because it was just what we did from the beginning, and so you just knew, you could merge it, you could look at it. It becomes muscle memory. You get that dopamine hit. You don't feel like you're done until you've gotten the dopamine hit of going and looking for it.

Charity: That moment, when you have just deployed it, you have all the context in your head. You know exactly why you did it. You know what you tried, what trade-offs there were, what you didn't do, what the point of it was. What the variables are named. What the functions are named. What failure will look like. If you go look at it, you will never again be as primed and ready to understand the code, and see what's working or not, ever. Nobody could ever come in your tracks and reconstruct what was in your head. That moment, it's only going to decay from there, and it's precious. You should look at it as quickly as possible.

Charity: Another thing that I like to talk about with the deploys, is just the idea of judgment. Having good engineering judgment. I think part of the way, one of the only ways that you really develop good senior engineering level judgment, is by immersing yourself in production. Not just looking at it when it's broken. Also looking at it when it's good. Because that's how you know what normal looks like. You ship your change, and you go look at it. Nine times out of 10 or more, it'll be fine. But that trains you over time, "This is what normal looks like. This is what weird looks like."

Charity: When I say, "Don't deploy on Fridays," what I actually am saying is, "Don't merge your code, if you don't have the time to stick around, and make sure it's okay." Because you should assume that once you've merged your code, it's going to go out. That's an atomic operation, as far as you're concerned. You merge, it goes.

Charity: If it's 5:00 PM on a Tuesday, and you're heading for the door, don't merge your fucking code. If it's on Friday, and you're feeling icky about it, it's a big change or whatever, first of all, feature flags are your friend here.

Jeremy: Yeah, I was just going to say that. Right.

Charity: But don't ship it. Don't merge. I'm not saying remove the human element of deliberation. I am saying, absolutely build that sense within yourself. But just make it about whether or not you're going to merge, instead of whether you're going to deploy. Because once you've severed that connected, that tight feedback loop between when you've written it, and when you're looking at it, it's just all downhill from there.

Jeremy: Right. I was going to bring up the point on feature flags, because that's one of the things too, where I think some people get confused. When you're like, "Well if I'm writing some big new feature, I want to wait until that feature's done, before I merge that in to the code base." I always look at it, and I say, "Well no, break the feature down to say, 'I have to deploy some new resource.'." Well, deploy the new resource, and guess what? None of your code touches it. But make sure that the resource deployed. Make sure it's there. Have any tests you need against it. Then do another release, where you add the ability to connect to that resource. Or whatever it is.

Jeremy: You can do that incrementally. Again, you can turn the whole thing off, so your code never gets run for 99.9% of your users. Except for maybe you, or a couple others, that you can turn that feature flag on.

Charity: Code is like an iceberg. There's so much more underneath it than there is on the surface area. There's still a lot of value in getting it out into production, even if you're not, "Using it." It's still being used. It still has the potential to break, or deploy cleanly. Small diffs still matter, even if you're like, "Well, but it's not ..."

Charity: Really, the thing about feature flag's that's so brilliant, is that it decouples deploys from releases. It makes, releases can be something that the product people think about, or the marketing people think about. Deploys can be something that engineers think about. You don't have to have all this waiting on each other, and timing, and big bang deploys.

Charity: The worst deploy in the world is always the one after people come back from Christmas break. The worst outages of my life have 100,000% been in January. The first two weeks of January.

Jeremy: That's the thing too. I love that idea of saying, deployments are what engineers think about. Releases are what marketing thinks about. Maybe the best thing that you can do to relieve the stress, is give your marketing team a big button they could press, to flip the feature flag for you. Then the engineers don't have to do anything. It's just-

Charity: Yeah, exactly.

Jeremy: You've enabled it. It's been running for weeks, and it's been working fine. So you want to release it to everybody, hit that button and then you're good to go, and-

Charity: Yeah, and it will allow you to-

Jeremy: You can be out having beers on a Friday afternoon, or whatever it is you want to do.

Charity: It will allow you to first release it to maybe internal users only. Or only the marketing team. Or only your beta users. You should never think of deploys as an on-off switch. You should think of it as a process, sometimes quite lengthy process, of increasing your confidence in new code. Progressive deployments are your friends. Canaries are your friends. Having all these little wiggly things around deploys, it's a process. It's like it's baking in the oven. You can't trust it until it's been through its paces.

Rebecca: Yeah, I like how your phrase was, "Speed is safety." I think what you're saying is, do the work to eliminate the things that feel scary. Or not even that feel scary. That could be scary. Do the work so that those things are not scary. The way to do that is through speed. I think you were going to say something. Before I quote you, do you want to follow up?

Charity: Go for it.

Rebecca: Okay. One of the quotes that I love from you is, you say, "Anxiety related to deploys is the single largest source of technical debt in many, many organizations. Technical debt, lest we forget, is not the same as, 'Bad code.' Tech debt hurts your people." I think there's something here between bad code, and let's talk about that, versus code smell. There's maybe some Charity philosophy here that I'd want to dig into a little bit more.

Charity: I just think that the anxiety that people have around deployments, I get it. I've been in ops. This is not about making everyone a masochist, like we have been traditionally. It's about, that this is really the way that it gets better. Resiliency, reliability is not about making your systems never fail. It's about making it resilient to lots of failures.

Jeremy: One of the things that you've also talked about in the past, is just the idea of fast feedback loops, being able to invest time in ops. I'm curious, because I saw a tweet, I think it was from Liz Fong-Jones. Maybe it was just yesterday. Where she retweeted something about another ops team, who said, "The number of these ...", what do they call them? I guess developer influencers, "Who talk about how CICD is supposed to be done, and how all this ..." That those aren't even practiced in the organizations that they're working in.

Jeremy: Because it's hard. There's so much legacy stuff. There's so many things that make it super hard to do.

Charity: It is hard.

Jeremy: I'm just curious. When it comes to driving these types of changes, do the developers have to do it? Is it the existing ops people? However we want to classify them. Is it management? Where does this change come from? where does that confidence, how do you build that confidence within your organization? Who has to drive it?

Charity: Well, I'll take anyone I can get. But ideally, I believe that this is, I think that this is a responsibility of engineering managers. I think that it's their job, to translate between engineering needs and business needs. It's their job to make the case upward, for why we need to miss these feature release dates, in order to deal with this tech debt. It's their job to make the case that it's worth investing on internal.

Charity: People do not do nearly enough work on internal tools. There was this great essay by Coda Hale, about work, and how you scale a team's productivity. Best case, if you're just hiring people, you can achieve linear productivity growth. Nobody actually achieves that. The only way that you can actually achieve the kind of productivity growth that you need, in order to keep doing more work with fewer people, is by investing in those internal tooling systems. By doing migrations. By keeping up your tool chain.

Charity: So many developer resources are just wasted these days. Have you ever looked at a company and, "Oh, cool app." Then you figure out they have 1500 developers? You're like-

Jeremy: Right? "What are they doing?"

Charity: "What the fuck are they all doing?" I guarantee you, they have a nightmare for a release process. It magnifies. It takes over from there. It all starts at that one little interval, between when you write it, and when it's live. If you don't get that right, you get into this death spiral, where it takes longer. You've got bigger diffs, and you're deploying more people's code at once. It takes longer to figure it out and recover.

Charity: I've got a number that I've pulled out of my ass, but I'm pretty sure it's true. Which is that, if it takes X number of engineers to build, support your system, with 15 minute time to life, if it takes them an hour or two, it takes twice as many engineers. If it takes almost a day, it takes twice as many engineers again. If it takes a week, it takes twice as many engineers again.

Charity: If anything, I'm being conservative. Because Facebook showed that the amount of time that elapses between when you write the bug, and when you find it, the cost of finding and fixing that bug goes up exponentially. Engineering managers need to be making this fucking argument. They're all asking for headcount, when in fact, you want to double your engineering cycles? Fix your fucking build pipeline. Engineering managers are too, I don't know what the word is. They're trying to fit in with all the business managers or something. Or they're not confident enough.

Charity: This is a winning argument. This is an argument that will make your teams ecstatic. This is an argument that will get you promoted. This is an argument that works. I've never heard of anyone who invested time making their deploys better, and stuff was afterwards like, "Oh man, we shouldn't have done that." It fucking works. This is, we need to be more aggressive about this, because somebody's got to start it. I don't care. Sometimes it's a CTO. Sometimes it's the VP of engineering. Sometimes it's a director or a manager, a senior engineer.

Charity: If that's you, even if you're a junior engineer, start looking for ways to improve it. There's a lot of low hanging fruit, I guarantee you. Making your build pipeline 15 minutes or less, it's just engineering. Instrument it as a trace. Use Honeycomb free tier, build events, instrument as a trace. See where the time's going, fix it. Every question that you have about that build pipeline, and where the time's going, is a great question to have. Do you still need that 15 minute thing that builds packs and ships artifacts? Do you still need ... Can you parallelize this? Is this still relevant?

Charity: These are questions. It just accumulates so much fucking crap, you know? Nobody's having the discipline to keep that number down. I think that, if I was a VP or a random engineering team, I would set an SLO for, "This is, code has to go out in 15 minutes or less. Engineers, engineering managers, your reviews will factor in how successful you are at the four door metrics. How often your teams get paged after hours, and how well you're able to stick to this time."

Charity: If they're like, "No no no, legacy. Corporate, and lawyers," and everything, at least start this for new services. At least auto deploy new shit that you bring up, so you aren't just adding to the pile.

Jeremy: Right, and the book, Accelerate, does point, the correlation between developer happiness and how fast their code gets into production.

Charity: Absolutely. This is, developers leave because of frustration with their tools. Nobody ever got burned out for shipping too much code. Developers get burned out from doing all the shit that's preventing them from shipping code. I'm not someone who is a fan of overworking people. I believe, as an engineer, you've got about three, maybe four hours a day of real, focused, hard engineering work in you. If you can get that in, moving the company forward three or four hours a day, meaningfully, is phenomenal. It is remarkable. That is rapid progress.

Charity: The problem is, people are working 15 hour days, and getting maybe 20 or 30 minutes of actually moving the business forward out of all that.

Rebecca: Yeah, so you've always been a big proponent of providing environments that respect and encourage this balance between work and life, and actually having productive time. Even if it's a shorter window of time. But actually use that time for being productive. I'm wondering if you have any advice for folks, who maybe think that they're doing that. How might they be able to smell out if they're not actually doing that? Is it simply a measure of, "Okay, how much time do my engineers actually have to work? What are they getting done in that?" Or what can they ask themselves, to actually reflect on whether or not they're-

Charity: Yeah, just ask yourself, "Am I moving the business forward?" Toil is not moving the business forward. Toil is clearing the road, so that you can try and move the business forward. I want to empower engineers to be able to spend most of their time on feature work. It just has to be done correctly, and that means investments. That means not leaving your technical debt until it threatens to swallow you.

Charity: If you're moving the company forward, it's a happiness index as much as anything. It makes people happy when they're able to move things forward. I think you could probably just take the temperature of every engineering team. I've done this before in the past. Just ask them, "On a scale of one to 10, what percentage of your time do you think you spend moving the business materially forward? 50% 60%?" People get pretty fucking close. They know how much of their time is worth spending on what they're doing.

Jeremy: Especially when you have a bunch of developers stuck in meetings all day. That's super productive. We're starting to run out of time, and I just want to quickly ask a question that we had from one of our listeners, who knew you were going to be on the show. They just wanted to know if you've seen a significant difference in dev ops practices, between people who are building serverless infrastructures, versus other infrastructures? Or is there a way to see that breakdown? Or is it pretty much the same?

Charity: That's very interesting. The lack of real durability of the data is certainly an interesting thing with serverless. The fact that, I don't know what state of the art is now. Certainly when I was really looking at it, there wasn't much of an idea of staging environments, which I kind of dig. I'm a fan of testing and production, as you might know.

Charity: I think that it forces you to treat your code as something that you're constantly, you're in conversation with it in production every day. You have to go check it in production. I think that the way that serverless approaches instrumentation is the way I see the entire industry should be approaching it. Which is, approach it as though you're asking your code itself to understand, from its perspective, "Where is my time going?", etc. Not paying so much attention to hardware metrics, or the shit under slash proc, or all of these meaningless dashboards that you really shouldn't have to look at as a software engineer.

Charity: Just having a habit of capturing every HTTP call, and every database query. The time elapsed, and the raw query, and the normalized query. Just gathering all of this stuff, aggregating it around the request, and shipping it off as an arbitrarily wide structured data blob. I think there are differences there. But I don't really feel like I'm completely up on serverless stuff enough to really comment.

Rebecca: In January, 2022, and as you said, January is one of your favorite times to come back and have a release, because it's always the best, right after Christmas. But something really exciting is really happening in January of 2022. That is your new book, Observability Engineering: Achieving Production Excellence, is coming out. You co-authored it with Liz Fong-Jones, and George Miranda. Super exciting. Love to be able to say their names on this show as well.

Rebecca: For those with an O'Reilly account, you can actually go and read the early release version. But Charity, will you tell us a little bit about who this book is for, and what people should expect to get out of it, come January, 2022?

Charity: Yeah. We've tried to make it very approachable and friendly, and written in English. It's just for people who, whether you're new to observability, or whether you're fairly experienced in the field, we tried to make it very practical. Where do you start? What is it? How is it different? Separate some of the marketing hype from the reality. What trends, both technical and architectural, are driving it? Who needs it? Who doesn't need it? Should you buy, or should you build? How do you instrument? How do you do traces? What's some of the philosophy behind it, and also some of the ...

Charity: We've got a guest chapter from Frank Chen at Slack, on CICD plus observability. We've got chapters on what tools you need to use, in order to achieve high cardinality, high dimensionality. We've got robust explanations for why it's not observability, if you don't have those things. It's less philosophy, and that sort of definition, and more, to the point that in some places, it's a little bit of a grab bag of stuff. It's just like, "This was useful. Put it in there. This was useful, put it in there. It's not really a chapter, but put it in there."

Rebecca: I wouldn't mind having a book that's just titled, This Is Useful. That's great.

Jeremy: Well I've read a couple of chapters, because I do have an O'Reilly account, which is super useful by the way. Getting to read some of these books as they're being written is really helpful. But yeah, practical advice. Ops minded people, I think this is definitely the book for them to pick up and look at, so-

Charity: And for software engineers. And for software engineers-

Jeremy: And software engineers, right.

Charity: ... who take their responsibility seriously. Yeah.

Jeremy: Right, absolutely. Totally agree. We are out of time. Charity, thank-you so much for being here, and sharing all this with us.

Rebecca: Thank-you, Charity.

Charity: Yeah, my pleasure.

Jeremy: If people want to find out more about you, and Honeycomb, and all that stuff, what's the best way to do that?

Charity: My personal site is charity.wtf, and I'm on Twitter, @mipsytipsy. There's also the Honeycomb.io/blog.

Jeremy: Awesome. Well, we will get all that in the show notes. Thanks again, Charity.

Charity: Sweet, thank-you.