Episode #77: Serverless for Operations with Ryan Coleman

November 30, 2020 • 67 minutes

On this episode, Jeremy chats with Ryan Coleman about how his work as a sysadmin and stint at Puppet helped fuel his passion for Ops teams, why serverless allows Ops to apply their creativity, what operations looks like in a serverless world, and so much more.

Watch this episode on YouTube:

About Ryan Coleman

Ryan Coleman is Vice President of Engineering and Product at Stackery, a serverless platform to design, develop, and deliver modern applications. Ryan is an accomplished product manager and ex-sysadmin who spent the last decade working with enterprise operations teams in the Fortune 100 to automate global infrastructure with Puppet.

Stackery: www.stackery.io
Twitter: @ryanycoleman

Watch this episode on YouTube: https://youtu.be/tEa2eJLwjZA

Transcript

Jeremy: Hi everyone. I'm Jeremy Daly and this is Serverless Chats. Today I'm chatting with Ryan Coleman. Hey, Ryan, thanks for joining me.

Ryan: Hey, Jeremy, good to see you. I'm looking forward to this chat all week.

Jeremy: So you are the Vice President of Engineering at Stackery so why don’t you take a minute, tell listeners a little bit about your background and what Stackery does.

Ryan: Yeah, so I'm mostly a system-man by trade. I kind of been tinkering with computers most of my life and sort of to pay for my college I started doing IT support that led to more advanced operations roles that led me to some VM automation software called Puppet which led me out here to Portland, Oregon from Pennsylvania to help Puppet grow and be a product manager of professional services and sales and I got to wear a bunch of different hats and more importantly explore enterprise operations that some of the largest organizations in the world and yeah then moved on to Stackery this year to help them with their infrastructure as code platform which focuses on AWS serverless and it’s trying to help people sort of design serverless architectures, express that in AWS SAM infrastructure as code as well as some other languages, and then just provide sort of a workflow for delivering that bringing environments to deliver different changesets over the AWS infrastructure, all that kind of stuff.

Jeremy: Awesome. All right, so there's always debate in the serverless sort of ecosystem or peripheral ecosystems to serverless that talks a lot about this idea of no ops or dramatically reducing your Ops. So I tend to believe that serverless dramatically reduces your ops because there's less things you have to worry about but I don't think that reduces the amount of operations work that can be done. And I think you bring a really interesting perspective because Stackery is a hundred percent focused on building serverless architectures, which is great, but it is for operations teams, right? It's not really for your front end developer. I mean, your front end developer can use it or a developer can use it, but it's very much so focused on the idea of bringing a cohesive operations, I guess, I don't know, sort of like Mantra to a serverless infrastructure.

And with all your experience, especially with Puppet, which again was also like, you know, automating pieces of the infrastructure and like turning ... saying to ops people, “Okay, we don’t need you to install patches anymore. We don't need you to do this because this can be automated.” I think you're going to bring a really interesting perspective. And I'd love to talk to you pretty much about operations for … you know, that serverless is really for operations right, in a sense. I don't know, maybe that makes sense, maybe that doesn't. But maybe we could start by just going deeper into your background at Puppet. Like what were you finding when you were bringing in essentially an automation software to take some of the operational load off of the operations teams?

Ryan: Yeah, that's that's wonderful. I think there's a couple ... there's like two big cultural trends there that I think are worth talking about and I joined Puppet in late 2011. So think like early configuration management movement, early DevOps movement. So everyone was kind of chasing this idea of we can automate configurations on VMs, like it’s very ops-focused, but we can automate everything about the VM provisioning configuration and maintenance process and we're going to reform how we think about these teams. I like to think about how traditional development has this sort of waterfall effect where the business is coming up with why we're doing software, why we're doing IT. Development is getting to decide, well, what are we going to do to solve this business need and operations is that tail of like, well, how do we actually get this in front of customers?

And it usually flowed in that direction and dev ops was a lot of saying well, let's kind of get together into some form of a circle but in classic IT operations, ops was always chasing everything even if they were in the circle. They still had to maintain all the infrastructure over time, they had patch cycles. they had upgrades to do so, they were never really participating in that full loop as much as the business and the developers were and so if ... I came into Puppet as a Professional Services engineer during those two big kind of cultural movements, and I got to go to both public and private trainings, and the public trainings I would be doing 30 people handed off hands-on for three days, right, eight hours a day and it's a mix of lecture and it's a mix of hands-on labs.

And these people generally were operations folks who were told by their business to come and attend. Some were leaning in and interested, others were doing it as they were told and they didn't have a whole lot of exposure to infrastructure as code. They oftentimes were learning version control systems like Git for the first time, right. They're pretty behind development trends on that and they're also really new to this concept of automation. Although the time really everybody was for VM automation. So in the public trainings we kind of had this like mix of characters and in the private trainings I was there to also deploy the software and you would get this sort of room of characters, and that was my favorite time because you had the people who were going to learn Puppet and you had the people around them: the developers, the product owners, like the other representatives of that triad.

And what I found in that was is so often people were hostile towards me, towards the company, towards this idea of automation, and we get this sort of persona who I would see at every one of these trainings, you can find the person who's just sort of back in their seat, arms crossed, really just not thrilled to talk to you and you would start to try to open them up a little bit and they would just be like,” I don't understand why we do this thing. My job is working just fine. This automation is just going to remove my job. Why should I even be here?”

And then they would hear about what Puppet does, and they would see me use it. They would go through a lab of their own and by lunchtime, they were asking questions and by the end of that first day arms weren’t crossed anymore. They're leaning ahead in their chair and they're having conversations with me at the end of the day about like, “Wait a minute. So, all this stuff that I hate about my job, this thing just cranks through it and I'm still the decision-maker about what's going on and I get to control this process through?” Like they thought this thing was just going to be some magical AI that was going to totally eliminate their roles. But in the end, I think what is kind of relevant to your question here, it's about freeing someone up to do the creative work in their job, to make decisions that help the business, that do the work that helps the human brain be its most effective, and all this repetitive work where humans make the most mistakes, where it’s most stressful to cause outages like that stuff is what automation on Puppet was solving for and what I think serverless solves for operations as well.

Jeremy: No, and I think that and I think there's two conversations there as well. I mean you have this idea of automating away certain things that are prone for error, right, so that you know anytime you have to set up a new server and you have to install different, you know, different libraries and you have to make sure the configurations are correct and it's got to connect to these load balancers and things like that using Chef and using Puppet. And using those services to do that and build those out reliably every single time is just something where someone still has to configure it, someone still has to sort of monitor it, someone still has to think about it. But wouldn't it be amazing if you could say is there a better way for us now to maybe, you know, scale now. Maybe can we can work on auto-scaling as opposed to just making sure we launch a new server or maybe we can work on, you know, some other optimization there.

So there's that one piece of it, but then there's the other piece of it which is I think where serverless brings us down further, which is this idea of taking tasks that still require humans, that still require maybe a bit of creativity that management piece of it, but also taking that burden away as well, right? So even like patching and some of those things, not all of that can be automated right? There's still some manual work that might need to be done but essentially outsourcing that to a managed service provider like an AWS, for example, that again reduces other parts of the job that I find those that type of work, that stuff that absolutely has to be done, that can't necessarily be automated, it always gets in the way when you're working on something bigger. Right, like, so let's say I'm working on my CI/CD pipeline and suddenly I realize that there's some vulnerability and we have to go ahead and patch all these servers, we've got to do all these kind of things. That distracts you from working on things that are probably much more important. So I look at this and I say the more that your ops team, you know in quotes, can get rid of the things that aren't adding value to the business, that just gives them so much more time to go and start working on the things that actually do matter.

Ryan: Yeah. Yeah, and I think that's so critical on so many of those conversations I have when that sort of hostile individual started becoming curious and started to become engaged was them talking through all of their responsibilities. They're feeling that pressure from the business to meet a developer need, right, we need this service, maybe it's database cluster, maybe it's, you know, a load-balanced web farm. They then are responsible for the customer experience of that. Do we have enough capacity for the load we're expecting? Are we spending more money than the business should really be spending? How reliable is this service? How do we monitor, trace, observe what's going on when things fail so that the team that’s responsible for that customer experience can respond not blindly to outages, right. Do they understand the architecture? Do they understand how to debug it?

And when we started having conversations about those key roles, “I’m responsible for reliable infrastructure, the customers’ experience that meets the business need and doesn't overwhelm the business in terms of cost.” Those kinds of things aren't covered by automation software, aren't really covered by, like, the core managed service that you’d be consuming in serverless. There are things that those people still need to bring to the business. That's the creative decision making, that's kind of identifying the right tools, connecting these things into a pipeline.

We talked about CI/CD as a stepping stone to saying we're going to give developers a consistent way to deliver software as quickly as they want with the right sort of automated controls to say code has to meet certain criteria before it goes out, we can validate code with automated test suites anytime, and then I don't have to be involved in the software delivery process. I can codify what I know to be successful about that process and give everyone in the team tools to improve it. That's all the same conversation, isn't it? How do we provide software that's going to cover that need that isn't core to the business and free up the humans to spend their time on that core business need and Puppet was doing that like crazy for VM Automation and serverless I think is that next wave of saying how do you bring that extraction even further up instead of paying a vendor for software that automates the VM? What if you just pay the vendor to make the VM go away, right? And that's not applicable for every workload not applicable for every business necessarily, but it's applicable for so many commodity services like say a database cluster where now you don't really need to be managing those anymore. You just need a MySQL interface.

Jeremy: Yeah, and you know, in some of these managed services that you can use ... I mean, one of the biggest things I think that you hand off to a managed service provider is the idea of reliability, right, that uptime, right, you know, the redundancy that has built-in with some of their applications. I mean Lambda runs across multiple availability zones, right? So you never have to worry about a server going down and your Lambda function’s not going to spark up. DynamoDb same idea. A lot of these services do that.

You mentioned billing, right, which I think billing is a hugely important thing when you start building in a public cloud because everything is metered or a lot of things are metered so you need to understand that billing piece of it and that's an interesting place, too, where I think developers and ops people can work together is on the billing side of things because if a developer goes and looks and says, “Okay. Well, this is how I've architected something. This is how I wanted to run, you know using all the serverless infrastructure, whatever, and here's what the costs are going to be. Then, you know, part of the operations team could be to work with them on what that billing is. Look at ways to consolidate things or, you know, optimize them or whatever, but somebody's got to do a tremendous amount of research in order to find out what the best way to use an individual service. And I don't know if that entirely falls on the developer or if a lot of that should fall on the operations team.

Ryan: I think it falls on that triad, right, if you have, say, a product owner. I decide most of my background either in product management or systems administration. And if you're operating a SASS, the product manager is that representative of the business need right? We're offering the software service, we have a margin that we want to care about. They're making decisions about price points. They're making decisions about feature packaging. They should be owning, relatively speaking, here's how much we want to spend to operate this service. Here's how much we want to spend to build the service, right?

A lot of product management is deciding how engineers should spend their time. We want to invest X number of weeks on this feature and all the sort of agile methodology or any other kind of scoring system is really to help say, “Hey do developers have a good sense of how long something will take and does the business want to invest that long for the anticipated return of that feature?” The ops infrastructure should be considered as part of that and if you're consuming managed services that is part of it. And so I think that is less on the developers. I think it's kind of the operations teams and that business owner whoever ... whatever role is being played there to decide whether that cost is enough But I think one of the things that I'm curious for your thoughts on here ... the meter billing and how transparent that billing is. I've seen people have sort of sticker shocks of that and then I start having conversations about well, how much time did you spend kind of building up the service on your own with say EC2 instances or VMs in your own data center and the math starts getting real fuzzy real fast …

Jeremy: Right.

Ryan: And the more you poke into it the more you start to think, well are you even considering how much energy use spent? Your wage, like, good dollars, you're spending to construct the service, let alone the upkeep, are far outweighing the upfront cost of provisioning a service. And when I learned a lot through working with these large Enterprises at Puppet is that it's a real high bar. Like you have to be in a pretty high volume production service before that trade-off starts to become so consequential that you really do want to have a bill versus by versus really optimizing things and that's not to say that cloud is cheap. It's just time isn't cheap either, I think is the point I'm trying to make. I’m curious if that's come up in your experience.

Jeremy: Yeah. No, I mean I think total cost of ownership is a hugely important thing that people need to pay attention to. Right, so it's not just about how long does it take you to build something or, I should take a step back, not how much does it cost you to host something, it's how much does it cost you to host it, but how much did it cost to build it? How much does it cost to maintain it? And then what’s that long-term maintainability look like especially if you have turnover, you know, I mean you need to bring new people in to learn something, learn something that is already there that you wrote custom. It's so much easier to say to somebody, “Oh, hey, we're using XYZ product for this” and be able to find people who are doing that then to say, “Oh, we're using this internal product called, you know Apollo or something like that, something we made up. You know, this is Apollo X and so this is the service that we built and now you gotta come in you gotta learn that and I think that is incredibly expensive.

But going back to the idea about the … also with the sticker shock, this is where I think there's a disconnect maybe between what I think about, and I say “I” but I'm sure there are more people who think this way, what I think about in terms of the reduction of operation costs because you're using a managed service. Now, first of all, you're hiring a world-class team to maintain your DynamoDB database as opposed to hosting your own MongoDB or something like that, if you hand that off now does it cost you more? Yes. It costs more but you don't need people to actually do that. You don't need people to be managing that service for you. So that saves you money from people needing to manage that service.

But the way that I look at it is you don't want to say okay just because we don't need someone to manage this doesn't mean we don't need operations people. You want to take those people who would normally be managing that database or you know, those server clusters or whatever, and use those people to work on some of the other things that matter like we talked about earlier. And I want to get back to something because there are other things that I think we need to figure out what falls on the developer and what falls on the sort of ops person and how much responsibility you want to give to a developer and where there are the opportunities for them to work together. So one of these things I think has to do with observability and we've seen traditionally that Monitoring Solutions was how much CPU is this VM using? What is the memory? How many operations are running at a particular time or whatever?

And the problem with those metrics were those were just to keep the servers up and running and when you don't have to pay attention to those metrics anymore, then what metrics become important and a lot of that comes down to application metrics. But I think that developers have less experience understanding metrics than operations people do, you know, with experience with metrics. So if you can have the shift say what are we looking for not only from an operational standpoint? Because they're still operational metrics, right? We still need to know how many invocations there were, what the latency was, you know, what the error rate is, things like that. But then there's other metrics like how many new signups did we get today? And some of these other things now a lot of that can be built into sort of one, you know observability system that goes back and not only lets you track these business metrics and these application metrics, but also then gives you a window into errors and ways in which you can either debug your applications or speed up or you know, speed up finding where a bug is or something like that. So just what are your thoughts on that where you know ops and dev kind of meat now with this new idea of observability.

Ryan: Yeah, that's such a brilliant point and it is something to me that I found so exciting about the DevOps movement was how these people were by the business being sort of encouraged to work closer together. And I think I found in a lot of those conversations, a lot of enterprises, that everyone got a little confused about what DevOps meant. A lot of people miss the point entirely that it's about culture and about teams collaborating and about how that's all meant to align to a business need and how everyone's playing a role. A lot of people kind of got lost in well, DevOps equals certain tools and Puppet might be one of those tools, you know, my CI system’s one of those tools and that's irrelevant. Similarly, I think the task is kind of irrelevant.

People get focused on well, is monitoring now the purview of the developer because there's less CPU cycles to monitor as you kind of alluded to. I think of this more as what is that human who's kind of fit in that role was developer or operations? What are they really specializing in? What is their sort of core gift to the business and generally speaking. I think that a developer is really skilled at taking gnarly logic problems, thinking in terms of data structures, thinking in terms of how that's going to play out in terms of providing some outcome, whether it's backend and frontend I think is really about taking hard problems of, “I need to make something from whole cloth and I'm going to think about the logic and the data necessary to do that.” Whereas an operations person is thinking more in terms of systems and long-term trends. They're thinking about consequences between, you know, it's almost like a mental Rube Goldberg machine where they're in their head visualizing all the little steps that happen. Oh, well if this ball goes down this ramp that's going to hit this Domino and that's going to cause this other spinwheel to fly and we don't want that to happen too quickly or else the whole chain will break down, right?

There's like this difference in sort of mental models that I've seen so clearly in all of these teams and of course individuals differ, but that to me is the general trend and that's where I go with this sort of observability and monitoring trend is an ops function. It's not solely theirs, right. As you kind of mentioned, the CPU cycle monitoring was really the ops purview because a developer didn't need to care about that, really. They just needed … they should have been having more conversations about what expectations they had about, “Hey, this particular part of the application is going to be way more CPU heavy than memory heavy and then the ops team ideally is having a conversation about, well that may change what sort of EC2 instance profile were applying to this particular part of the application or maybe we're going to split the application between a memory focused VM and a compute focused VM. That is a conversation that should come out of monitoring CPU metrics, but if that no longer exists, the ops team is still thinking about what is the overall portfolio customer traffic that we expect, how much are we willing to spend the kind of overcapacity versus scale? And burst on demand? How does bursting behave on this application architecture? How do I get all the different AWS monitoring options to come to bear on that problem. That should be a collaborative discussion still but I think because of that sort of trend in system view that operations people generally bring it's a great role for them in serverless.

Jeremy: No, I definitely agree and it's funny. I've always looked at developers, sort of the role the developers as again being problem solvers, right? You're solving some sort of problem with data with, you know, with code with logic whatever you're doing and then always looked at operations teams as the people who could then sort of implement and scale the solution to that problem. Right? Like they provide the, you know, the infrastructure for you to do that. Now that line, you know with DevOps, as you know, DevOps really wasn't about changing roles as more about just better communication, which you're right, I think a lot of people still don't understand exactly what we mean by DevOps, but that line is becoming very, very blurry as you get into serverless, right, and people start writing cloud formation maybe but then they start using things like Stackery or they start using SAM or serverless and it's easier and easier for the developers to now create infrastructure and part of creating infrastructure is architecture, right? So now where's that line? How much of that architectured design is on the developer? How much of it is on the operations people and where they are now opportunities for them to collaborate.

Ryan: Yeah. I, this may be a frustrating response, but I don't think the fundamentals changed. Like, let's take throughout example of serverless. Let's say you're building a sort of modified web application that has sort of a front end here and has some back end infrastructure maybe a few APIs and a data layer. A developer and an operations person on the same team-building that stack should be having a conversation about what that architecture looks like. The developer’s going to have some constraints. Maybe they prefer this sort of NoSQL approach to data and they're going to ask for like a Dynamo key-value store, or maybe they need a relational data set. So they want more of a MySQL and Postgres. Now, the ops team is going to be talking about, well, maybe if it's you know, let's say MySQL, should we be building that whole infrastructure from scratch? Well, we're not sure how much business we're going to get on this application. So what if we rent it; what if we're not doing that.

Now the ops team should be thinking about well, what needs to talk to that database service? What are the permissions of that? And this is where like if you're an operations person who came up through say Unix infrastructure, you've gotten you know, or Claw or I guess any infrastructure, Windows or Unix is just changing the sort of flavor of the commands. You are the one thinking about firewall rules. You're the one thinking about file ACLs. You're the one thinking about how, you know, certain operations can happen between devices on the network, right? That's your purview. You have a mind that's primed to think about that and all serverless does is change the shape of the boxes to check, right? Now you're thinking of AWS IAM, you're thinking about the security groups that you're applying, your thinking about how fine with Rain you can be about the database transactions given API server can interact with that database. The developer may be thinking about those things but no, not really, right. They’re thinking more about the database transaction that they need to make on the API and I think the beauty of serverless is the saying because the ops team doesn't have to go away for three months and come up with a new MySQL cluster that meets the business need they can just rent one. It can be an Aurora database cluster from AWS that just you know, how many compute units do I need to reserve for this workload: done.

Now, they're spending their time offering the team a really secure-by-default infrastructure and just to be a little Stackery-biased for a moment, that's been one of our most successful feature sets, is that when you go in and take an Aurora database cluster connected to Secrets Manager or auto-generating the rotating credentials that get stored away in AWS Secrets Manager. You then connect that database service to a Lambda function or a gateway and it's automatically generating the IM role that says only this specific ARM can talk to this specific ARM and then the ops person can come in and go even further and say only these types of transactions. That is not something developers are commonly thinking about. They just want to wire it up and start writing their app and I think that's okay and that's where ops can play a big role especially once they're freed from the sort of big upfront project cost and then the long tail of patching that my SQL cluster

Jeremy: Right, yeah, and actually I just talked to Matt Coulter who created the CDK Patterns site, works at Liberty IT, and his approach to building a lot of these CDK Patterns was to encapsulate a lot of that operational stuff that a developer might not want to think about so, you know, a developer can launch an API gateway with all of the security and all the endpoints secured and you don't have to worry about setting all that up. But someone still has to know what that is and manage those constructs and do some of that stuff which I think is interesting.

So I agree that the developers don't necessarily want to think about some of this stuff. I think a lot of them do, right, and I think that also is dependent upon the size of the organization. I think if you think of any small startup, any good small start-up, you know has that one person, right, who he or she knows how to code, they know how to set up servers, they know how to do all these other things, right, and they can come in and they can do that full thing and that full stack piece of it. I think you see that with serverless as well. But if we go back to the enterprise for a second because this is something where I love the idea of serverless in the sense that it it frees up time to think about one other thing that goes well beyond observability, it goes beyond reliability, it goes beyond billing, it goes beyond all these other things and that's this idea of resiliency.

All right, and I think that is one of those things where when you were building single stack monolithic applications, it was like the server's down the system's down, right? It was everything runs together, you know, if something fails everything fails and we've seen as we went to service-oriented architecture, as we moved into microservices, and things like that this idea of resiliency within distributed systems has become hugely important and I'm sure you're familiar with chaos engineering and all that other stuff that's going on there. So what are your thoughts on that? Where are the opportunities for developers? Because again, you can't just say, oh we're going to just flip a couple of switches and now we're resilient. I mean you have to build things into the code, right, there needs to be parts of your application that can react, that can reroute, that understand things like circuit breakers and some of those other things. So where is some of those opportunities for ops and devs to work together to build more resilient systems?

Ryan: I think that's … to your point, though, first on the sort of unicorn human who does exist in every one of these startups and they do exist in those enterprises. I do think of those people as special but they are generalists really right there. Like they know enough of that full-stack to get going but there's only so much time in a day. So like unless they're bringing so much experience and they've kind of iterated over time and they become really specialized in every one of those categories, there's parts of it that they’re like I've done enough to get it working and I haven't thought through maybe that security angle. Maybe it's like the cost-benefit angle and then you're talking about resiliency. I think that's another space where sure, they can ... you're going to get somebody who's written a front end, they’ve written the API, they've set up the infrastructure layer, but did they also go and write the sort of scale testing suite that runs as part of the CI/CD pipeline to check against regressions when someone's code path changes and suddenly a thousand requests per second breaks the app whereas before the app was responding well.

That ... whether that is the developer/ops, I'm a little less opinionated on which side of that role ‘cause I think we’re starting to talk about are we exercising the code path? Are we exercising infrastructure and how it’s handling that. And serverless I think adds a new wrinkle to that where probably both roles need to get a little more collaborative because are you. You know … let's say you've got a bunch of the patterns that you've got up on your site. One of them's a DLQ pattern, you've got then sort of the data layer pattern and you're causing these requests to kind of flow through this distributed architecture 10 different lambdas, you know a bunch of different APIs, a developer might be really good at helping any of those individual code paths get exercise. They know like yeah, I can exercise this lambda a whole lot. Everything's good. But what if it's the reaction between six different pieces that causes the whole thing to come down and the DLQ gets filled up, like, that is where an operations person I believe brings some perspective like, again, that Rube Goldberg analogy: they’re thinking through more of a distributed architecture consequence and they're not necessarily thinking through, “Oh, hey this library call that you're making in this Node.js function, that's going to cause, you know, a rate limit, right?

And that's where I think both of these teams having a shared conversation about resiliency gets you a fuller picture. And again, you have to codify that, right, you have to put it into either some, you know, one-off test that you're running when you make big changes or ideally part of your delivery pipeline whereas changes go out your stress testing things to meet your expectations or not. And you're doing that in a safe space and then I would just give you one more pitch for serverless … in the VM world so many enterprises using Puppet, they're making such trade-offs for their scale test environment because they only have so many VMs to go around, only so many so much hardware to go around. Or the orchestration of those sort of scale tests on VMs are really complicated. In serverless that's not a problem. It's just dollars. So how much do you care about scale testing? Okay. I'm going to spend those dollars for five minutes as every time I open up a pull request. If I don't care about it that much maybe I do it less frequently. All right, but you’re no longer bottlenecked by where are those VMs for the scale tests going to come from.

Jeremy: Right. No, that's definitely true. So, and I think you're right, by the way, that sort of developers being more responsible for executing the code paths with operations or potentially another hat which could be architects, right? I mean that because again, I think if you think of what happens when that DLQ backs up, there are business rules in place, right? So that partially has to do with the business owner in terms of what do we want to do with backed up DLQ? How much can we do load shedding and some of these other things, what are the ways to do it? There's a lot of roles that need to collaborate, I think, in a modern infrastructure where you need to think through all those things. Definitely take your point on the generalist though. I don't even know sometimes if I started recording these podcasts because I'm too busy doing a million other things. I did record this one though. So that's good.

All right another thing just quickly because you did touch on security. Where does that fall, right, because, again, a lot of this with serverless because of the shared responsibility model and some of these things, a lot of that security falls now to application security. So, you of course have IAM roles and you have permissions and all other things like that that have to happen within the infrastructure and securing the infrastructure and that's definitely I think on the offside of things, but what about applications security and how much does DevOps come into play there or sec DevOps? I guess. You know, where does that responsibility lie?

Ryan: I think that's one of the more interesting questions of this movement, ‘cause the thing that excites me the most about serverless, and this is a little biased because one of the problems Puppet was trying to help solve for its customers as I left was this sort of vulnerability management on VM infrastructure. And that's probably in my mind one of the last miles for sort of operations like VM based operations. If you totally embrace configuration management you no longer have sort of a provisioning configuration problem, no longer an orchestration problem, like you have new projects to apply those sort of techniques to, but it's no longer this sort of thing that's always in your way. And sort of taking care of patch cycles and fixing vulnerabilities on the infrastructure side is still a huge problem in the VM space, right? The vulnerabilities are outpacing patch cycles. The IT sec team is always putting pressure on the IT ops team to get faster. They're staging spreadsheets between each other to say like, well this patch needs to happen on that machine and you ... I don't know if you'd be surprised but I'm still shocked at how much time teams spend exchanging spreadsheets to talk about work that had already been done by an automated process. Right, like that time-spend alone let alone how much more there is to do.

Now, I think one of the things that gets me excited about serverless is saying, well, now we're outsourcing that responsibility and sort of that work to the managed service provider like AWS. Now the ops team clearly has a ton of work they could be doing, should they be taking more of an active role in the sort of application security space. Developers have a ton to do too. So that's where I'm not quite sure where that line goes, but I think an ops team who has been thinking about prioritizing which vulnerabilities to tackle first, they're the ones generally exercising the exploits in conjunction with the IT sec team. The mindset they bring I think is interesting. So maybe that looks more like in the serverless world building up the tooling to help reinforce these things. Maybe they're the ones, you know, installing Sneek or they're kind of building up their own tools and they're making those part of the CI/CD pipeline and the application team is done thinking about their library, their dependencies on their npm making sure that those are constantly cleaned up or maybe the ops team is fitting into that because they no longer have the patch Cycles. I don't think I have sort of a certain answer for you there. But I think it's such an interesting space that as someone with a lot of data on the internet I feel like is a really important question to get answered and certainly both roles have a lot to do there, right?

Jeremy: Right. Yeah, no, it is a challenging space and I think you've got some tools that have been developed as well. I mean, even just putting a WAF in front of API gateways or CloudFront or some of these things to protect against sort of basic application level attacks. And then also again just codifying security into something like Cognito or Lambda Authorizers and using those using those sort of things as a way to lock down endpoints and again, you know, containing the blast radius, all these best practices that you have I think a lot of that falls on the on the operations people and on the architects and then and then as that is given out to those developers, it gives them a little more freedom to make some mistakes not as many as maybe you could in some other places with better, you know network tools.

But anyways, all right, let's go take this little bit further into operations in the serverless world because I kind of said at the beginning, you know, I feel like serverless is more of an operational thing than it is then it is like a development style or anything like that because really what you're doing is you're automating a lot of those operations for you. So what does, you know, what does operations look like in a serverless world I guess.

Ryan: So I this is certainly going to show some of my biases but I think it does come from people who are, you know, coming from a VM automation world and become familiar with infrastructure as code in the power that has to codify the need an operations team has on the infrastructure that meets the development and business needs and then is familiar with sort of the automation tool chain around that, whether it's orchestrating, you know, changes across devices that require rollback strategy and require sort of sequencing of, you know, updates and traffic on the load balancer or it's something more mundane of just provisioning to infrastructure. Those things still apply in serverless and now you're just changing what is the infrastructure as code format, right, instead of Puppet or Chef or Ansible, you're talking AWS SAM or CDK or serverless framework or whatever. And, like, HashiCorps Terraform is big in that space, right? That one's bridging the VM side of the world and the manage side of the world. That role then is still fundamentally, what are the services my development team needs to solve that business problem. Right?

That is writ large, number one the operations role in serverless. I think there's a little bit of a red herring here that we should address that I think is very much the same red herring as we saw in the VM movement, which is developers don't need ops anymore because they can go and self-service stuff, right. As we've been talking about through this conversation, there's so many parts of the responsibility that a developer could go and learn and could spend their time in, but in what trade-off, right. What are they not doing in their development life? So that was the same thing that happened in the VM movement where I would walk into a company that was considering Puppet and they were responding to developers getting purchase cards and going to AWS and renting EC2 instances doing whatever they wanted to those instances, a vulnerability would be exploited in the business panel, right? Classic case, happened that every business I walked into.

Same thing’s happening with serverless. People can just get an AWS account, provision a cloudformation template and they're off to the races. But do they think about the IM roles? Do they think about the scalability and reliability of that infrastructure? Are they maintaining changes and orchestrating those in reliable ways so the customer traffic isn't dropped every time you redeploy that cloudformation template and those are the things that I think ops is still responsible for and there's still opportunities for businesses to say we don't need that role anymore, we’re just going to buy it from Amazon. It's not that simple, right. You're paying Amazon for a lot of that responsibility. But those ops professionals still need to come in and think about the reliability of the service, the cost of that service, and how to secure it and it's just the knobs have changed colors and change sizes. Still the same work.

Jeremy: And I totally agree. I think that idea of, you know, am I automating away my own job, that's the kind of thing where it's like if you are spending time doing the same thing over and over again ... I forget the the calculation of this, this is outside the scope of technology, but essentially it's like when you're training somebody like if it takes you five times as long to train somebody one time as it does for you to do that one job, you should invest that time training them if that's going to be repeatable thing. It's the same thing with automation: that's if you have to do the same thing over and over and over again, even if it takes you five times longer to automate it the one time you know that for that one time, but then every time after that it's going to be taken care of for you. And time is the one thing for any human being that you cannot get more of right unless you want to work 24 hours a day, which I don't think anybody does.

All right. So then I think that's a that makes a ton of sense and I'm totally with you on that stuff. So what about that extra time, like, where can you then put that toward? Like so an ops person in the serverless space? What should you be spending your time on?

Ryan: Well, if you don't mind, I'll take this all the way back to my early career and give you sort of a story to illustrate how I chose to spend that time differently once I started automating. So one of my first sort of rolls was at Penn State, I was an IT administrator for the central IT unit. So, it was a pretty small staff and Penn State's infrastructure is wholly, or at least at the time I worked there, wholly owned by the university so they did facilities all the way up to the software layer, right? So there's a loading dock where Dell boxes came through. I was responsible for wrapping those boxes, giving them power. Now, there's people who are responsible for the actual power and cooling of the facility and these people responsible for the network of the facility, but I had to wrap those machines, connect them, and then provision them, configure them, maintain them over time right that whole full staff operations role and the sort of data center was below the office floor, right?

So we had a hallway and one of those little spiral staircases that went down into the data center and if an infrastructure went down for some physical problem, there's people running down the hall going down the spiral staircase into the data center where they're going to resolve the problem. So that was me showing up in this world, really small team responsible for a statewide infrastructure. So think tens of thousands of faculty, students, and staff, their email, their web infrastructure, their shared file storage, their ability to authenticate across the university, right, so their entire identity and their sort of permission systems. All of that was managed by a really small crew who divided into operating system lines. So there was the IX crew there was the Red Hat Linux crew of which I was a part of and there was a Windows crew, right? It was really a few individuals responsible for these platforms for the whole university.

None of it was automated. And this is a time when universities are stops. They're starting to shift from thinking about IT spend from being this cost center they had to control to be in the thing that was giving them a competitive advantage as a university, right? So universities with a better, you know global learning system or just a better sort of infrastructure. We're attracting better candidates. So there's a lot of pressure on this team as I'm coming in to update our services, to do more for different, you know, different colleges within the university who needed certain things from the central IT staff and then there was this pressure of public cloud, right? All of these two sort of distributed colleges are starting to ask for permission to use public cloud so they could serve their business needs so they can compete nationally. They were dependent on the central IT crew to provide an alternative to going out on their own. So I was asked to kind of maintain a Samba infrastructure. We had the privilege of using IBM's gpfs file system. If you've ever heard of that it's like this fiber channel distributed storage network that goes across the whole university.

So it's like, think of … I’m trying to illustrate here that there's all this plumbing to be responsible for and most of the IT staff really skilled people, really deeply committed to what they were doing for the University all day was firefighting. Service was down, changes needed to happen, patch cycles were behind, every single day, every single week: firefighting. So I started to bring in Puppet to just get out of my mindset of firefighting right now. I'm doing the same thing as everybody else just in my little corner and the more I started automating there, the less I'm firefighting the more I'm getting ahead on some of the service requests. Oh, we need to upgrade the Saba cluster from five, whatever to five this. Okay, I can tackle that now because all of this stuff over here is running stably and if I need to add capacity, I just put another blade into the bladecenter, Puppet provisions it, connects to the load balancer, all good. Right? Right. Like I'm applying those sort of considerations, but now automation is taking care of repeating those considerations.

So we had this person who was responsible for one of the more interesting technologies that I've worked with called Shibboleth, and this is part, it’s like a federated identity system that essentially allows someone from University of Michigan to rent a library book from Penn State, right, like that kind of stuff, federated identity across universities. And the network engineer was responsible for this; the person who's kind of responsible, head of the network and crew, and his process was to kind of … and think of like a really old school system in who does most things by hand, had one of those just gigantic clickety-clack keyboards, really sort of old school, I think it was like one of the early HP operating systems was his platform of choice. He had to take this XML document, ‘cause Shibboleth is shaped in XML, take it from the shared file system and kind of fiddle with it. And so you would see him and he would kind of peck type too, right, so he spent his entire morning peck typing XML structures to fill in this new identity they needed to add and then would get it wrong because XML is hard to do by hand and then just rinse and repeat the whole day. And all he wanted to do is just add this, you know, sort of ARN essentially, that was like identifying this other federated identity that allowed them to go and rent out a library book. And so I, one week spent a project with Puppet where Puppet would take the XML document, put it onto his computer, he would edit, it save it back, Puppet would then make sure it was valid XML, ship it if it was to a new Shibboleth cluster that was nonproduction and give him back a prompt that said like go ahead and do your test validation where he would go and basically try to simulate this federated request.And it took his life.

Like, my life was freed up on this automation. I spent a little bit of time giving him a transformational life where he wasn't spending his whole day editing XML. He was just filling in a thing, saving it, he got to validate it, and then he would get set to say go ahead and ship it and I would send that XML document from the nonproduction cluster to the production cluster. And so that's a bit of a convoluted story for you. But to me, it's like this ripple effect of how everyone firefighting I was managed to free up my space through automation and I could start giving those gains to other people and that to me is the empowerment of automation, whether that's serverless or VM based doesn't matter to me. I was applying operations knowledge to make people's lives better.

Jeremy: Right and I and I love that story because it is so it's like it's like deja vu listening to that. I can think of like thirty other times in my life that that has happened to me or something similar and one of the things that you have when you are so busy firefighting like you said is that you get to the end of the week and you might have a development team that’s like, well, what did you do this week, guy? I don't know. I think we fought fires. I think we fixed this cluster, whatever ... and if you're trying to stay on schedule, and you're trying to release new features, you're trying to, you know, go back and refactor things and add functionality or get things stable again, the more time you spend, you know, just fixing things that are broken without automating them so that they’re long-term fixes is just a complete and utter waste of time. So I think that is brilliant.

All right, before I let you go, I do want to talk to you about the jam stack because I think of, you know, web applications, especially in the public clouds of would … like, I mean, I don't know too many applications now that don't somehow touch the internet in a way, right? Even if it's a private SAS or whatever like there is information flying around the internet. That's how we're building applications. Now the jam stack for people aren't familiar that's you know, static site hosting using JavaScript with APIs and Markup or you know, there's a million different ways to do it. We had an episode that we talked with Guillermo Rauch about about Verselle and what they were doing there and again that was very much so tailored to the front end of it. It's like a front end developer being able to quickly launch something and then have a little API if they need to. But there is a significant amount of depth behind the jam stack that an operations team can take a lot of or can take advantage of so, I'd love to get your perspective. I know Stackery has done some things with the jam stack lately. So what are your thoughts on the jam stack and how that can help with operations?

Ryan: It's well, thanks for bringing that up. I'm pretty excited about about serverless jam stack in particular. And so I'll maybe I'll bridge from that Penn State experience just for a moment. One of the other things I was responsible for there was web hosting for any student faculty staff. Sometimes it will be working, you know for assignments in their class in terms of his own personal portfolios. And that was sort of just basic web infrastructure sitting behind a cash system that was running on an NFS mount for the shared IBM file system that was running University-wide and from the ops perspective, we spent so much time especially because it wasn't yet automated keeping that running and keeping it single patched and just keeping it maintained that we totally missed the boat on that early CMS wave like the Drupals of the world, the early word WordPress movement, and all of these faculty students and staff just started leaving this infrastructure to run their own clusters of this stuff and then they would have these scaling challenges and it would all start kind of coming in this sort of tornado of fire through the University where people were saying like why everyone left your infrastructure for this new stuff. It's not meeting their needs anymore. Will you run it?

And that to me is an example of how operations teams who aren't freed up to have that time miss a business need and it ends up causing way more work. And so I've been thinking about that experience a lot with what Stackery has been doing in the jam stack because when you kind of look into AWS has just one managed service example, it's phenomenally easy to ship the sort of front characteristics, right? They have the cloudfront CDN, which is global, at the edge, in cities, where people are bringing traffic. You just point it at an origin server. Could be your own. Well, let's say it's an S3 bucket. You just kind of deliver your hosted content to this thing and it's everywhere. Right? As someone requested is brought down to the cash, subsequent requests are super fast. There's no there's no need for it.

Now, okay, if you're only serving a static site, maybe it is that sort of quintessential jam stack that has client interactivity through JavaScript, but mostly it's just the static pages. You're great. What happens then in the enterprise and that's where kind of Stackery has found this jam stack sweet spot. There's then, okay, what if I want to run my own data layer? I was listening to a talk to you'll be able to find, maybe I'll drop it for you in the show notes, from Infoworld, where this person at PayPal was talking about their jam stack experience where they were implementing a jam stack architecture for their sort of peer-to-peer payment system. They're not using staff rooms bringing this up as sort of a public example of this. They are delivering the static site, which is essentially the shell for this mobile payment system, right and then dynamically on the client side, they're pulling in the user’s avatar. They're pulling in their balance. They're pulling in, you know, transaction logs and they went from running sort of their own cluster to running the sort of in a more serverless architecture.

And through … especially because that most of the content is those static assets, think of the whole HTML shell plus all the CSS plus all the sort of supporting JavaScript all of that being delivered statically through the CDN means that, like, it pretty much hits instantly on the client then JavaScript is making backend calls to fill in that transaction history. AWS serverless takes care of that infrastructure to make those static assets instantly available, but then you have sort of a build chain problem, which I'll come back to in a moment, and then you also have well, if you're sort of a payment platform like PayPal, you're running a pretty robust data system. You're running many many APIs. You want to be able to express those in some way right? And then you also need to orchestrate the change of those APIs with the change in the frontend for that JavaScript to leverage those new routes or take advantage of new data. And that's where I think Stackery’s been applying this approach where so many of our customers were running the backend and then they were going out to the Verselles and Netlifys of the world to ship the frontend and those pieces were disconnected. And we recently, earlier this year, put out a delivery platform so that any sort of serverless change can go through CI/CD, Stackery’s aware of those stacks, so you open up a pull request will spin up an ephemeral version of that stack that you can run your load balancing scale test against, you can verify and sort of a preview URL to see everything's working out right, and then you could, of course, deliver that, promote that to production where you're you know, updating your cloudfront distribution.

So that merger is pretty interesting as your backends and frontends get more complicated, but I want to take that one step further and then I'll shut up. That CMS thing, right, that pull from the early University, right. If you're building sort of the PayPal eCommerce app, it's one thing to just kind of ship all of that in your bill chain, deliver it to the CDN, and everything's taken care of your backend’s covered too. But what if you wanted to offer a CMS to your teammates, then you're back to managing the ends again. One of the things I've experimented with recently and then we have one of our healthcare customers picking up for their own needs is the ghost CMS platform. It's the sort of open publishing platform that provides sort of a Medium-like experience for editing, but it runs either on your own VMs or runs as a Docker container. Well, how do we make that serverless? The architecture I put together is using Fargate’s PCS or ECS’s Fargate system to run serverless containers and in Stackery’s AWS SAM templates you can specify the definition of that task that gets populated from data about the sort of whole build system around it.

So for instance, the Secrets gets pulled in, the OR database cluster is expressed. So this sort of thick CMS client can be provisioned as part of the same stack that is a Gatsby-driven frontend which is more of what you would think of it as a jam stock and every time you publish new content in ghost or you make a change to your frontend it triggers this build loop where the Gatsby side runs in CodeBuild, another service environment, that pulls data from that ghost infrastructure that then when it's completely built successfully ships off to the S3 bucket, which updates your CloudFront CDN. Right, that whole sort of chain of activities happens and then you can spin down this ghost cluster to the size that you need for your marketing team or other folks who want to work with the CMS and your customer infrastructure is super cheap, super fast, and super secure, right. So to me, it's like it touches on so many things that makes serverless powerful in a way that doesn't sacrifice a complicated backend whether it's something as simple as a ghost CMS or something as advanced as what kind of PayPal has been doing for their payments.

Jeremy: Yeah, no, and I and I'm with you there. I mean, I love the idea of using serverless obviously as the backend for a jam stack site because it gives you so much more flexibility and scalability. Right? But the more you can push to the static side of things obviously the better. I'm still waiting and I know that Verselle was doing this, I think Amplified Console is working on this now, my biggest complaint with generating static site is especially if you're pulling it off like a ghost CMS or something like that, is the fact that you have to rebuild the entire site and if you have very, very large side, so I think if you're running an eCommerce site with thousands and thousands of products and you have to rebuild every single page every time, you know, make a change to one word on the you know on the customer, whatever, customer appreciation page or something like that and everything has to rebuild. So I know some of these systems are working on only rebuilding parts of it, which would be really, really interesting in detecting those changes. I think jam stack’s got a long way to go. I think it's just like right at the beginning of the sort of where we are. But I love this idea of static first. You know what I mean. Static first, serverless second maybe, but I think that's a really, really interesting thing.

So, I think we've covered most of it. Is there anything else that we missed on the jam stack or the operation side of things?

Ryan: I think we mostly covered. I do want to touch on one misnomer about that static app. I think ... to me the important bit there is to say, ship is much statically as you can and especially if you’re wrapping this whole architecture in your build chain, and you're using serverless infrastructure. it's very simple to just say anytime I make a change go and deliver it. You're right that there is one outstanding problem this whole space needs to solve which is that sort of incremental builds instead of rebuilding everything every time. Tying takes care of that to a bit but it is annoying and it's part of a larger you grow the more your lead time for change expands. But people get scared away from the static thing. So I just want to kind of encourage listeners to consider its saying most of the things that would just be generated anyway on demand by a server farm somewhere are instead shipped statically to the CDN, which means that first touch your customer gets on your web application is really, really fast and that matters so much.

There was a talk at Netlify jamstack conf recently a couple of them both from eCommerce vendors and from other folks who were talking about how that first content full paint on a website is really the decider on sales and like people are going to leave you at such a high rate. It really matters to your business. Whether that's a service you're offering or as our eCommerce site where you're selling goods, that matters a lot. And so the static app isn't saying well, you can't have any interactivity on this site. It's just saying take all the bits that are interactive and make sure those are as quick as possible and then fill in the interactive gaps and you have this choice in serverless that I just want to cap on here, which is either obviously JavaScript is running on the client browser or, say, lambda function you interact with on that static HTML shell that really quickly interact with backend infrastructure that return data back to that client real, real fast, right as an alternative to that client side JavaScript. Those kinds of possibilities are really exciting and make it more of an interactive app that happens to be built during the development cycle. And then a lot of it is statically delivered to a CDN.

Jeremy: And you're totally right about that first paint. The first pain is so important. And again, it says bring this all back to resiliency. If you have a eCommerce site and a page loads on an eCommerce site that shows you the products. It shows you the pictures of the product. It shows you the description, maybe even some of the reviews are statically cached and those can be you know, those can be delivered immediately. If for some reason the price doesn't load or maybe the availability maybe that doesn't work because there's a subsystem that's down that that information isn't loading. You've still been able to provide some bit of information to your clients which you know, maybe they say, you know, you give a message. Oh, we can't load the price right now. We can't load the inventory right now. But if they like the product enough because they're able to see it then, you know, there's a chance they might come back and buy it, you know save it to a card something like that, whatever, those are opportunities that you miss if you don't have that resiliency built-in.

Ryan: Yes, spot-on. Awesome.

Jeremy: All right, Ryan. Thank you so much for spending the time with me and all the work that you've been doing over at Stackery. I love what your team is doing over there. I work with Farrah quite a bit for a number of these serverless days things and, whatever, so absolutely awesome work that you are all doing there. So, if people want to get in contact with you, find out more about the stuff you're working on or more about Stackery, how do they do that?

Ryan: You can find me on Twitter. It's @ryanycoleman on Twitter and then of course, we've got stackery.io if you want to see what we're doing for work.

Jeremy: Awesome. All right, and then the blog for Stackery which is called Stacks on Stacks, which I love that name of the blog. And then you have a sample website for serverless jam stack called jamstackery.website, right? That's sort of just a demo site.

Ryan: Yeah. That's right. We put that up after the jam stack conf a few weeks ago as sort of a recap. We enjoyed a lot of these talks and it's a way for me to further exercise this Ghost and Gatsby CMS architecture I was talking about so we are able to edit and write the content in Ghost, but then the site you interact with is a Gatsby generated site that's delivered on Amazon CloudFront, of course delivered by Stackery, but it's a cool architecture that anybody can run in their own AWS accounts and that website just kind of shows it off with some recaps of some cool talks I hope people check out.

Jeremy: Awesome. All right. Well, we will get all that into the show notes. Thanks again, Ryan.

Ryan: Thank you. It's been a treat. Take care.

This episode is sponsored by New Relic and Epsagon.