Episode #74: Measure and Increase Developer Productivity using Serverless with Vadym Kazulkin and Christian Bannes
November 9, 2020 • 67 minutes
On this episode, Jeremy chats with Vadym Kazulkin and Christian Bannes about how cognitive load affects productivity, why writing more code increases technical debt, and how building evolutionary architectures with serverless lets you focus on business value.
Watch this episode on YouTube:
About Vadym Kazulkin
Vadym Kazulkin is Head of Technology Strategy at ip.labs GmbH, a 100% subsidiary of the FUJIFLM Group, based in Bonn. ip.labs is the world’s leading white label e-commerce software imaging company. Vadym has been involved with the Java ecosystem for over fifteen years. His focus and interests currently include the design and implementation of highly scalable and available applications, container and orchestration technologies and AWS Cloud. Vadym is the co-organizer of the Java User Group Bonn and Serverless Bonn Meetup, and a frequent speaker on various Meetups and conferences.
Christian Bannes works as Lead Developer at ip.labs GmbH and has been working in the professional Java Enterprise environment for over ten years. In recent years, he has been working with AWS Cloud and especially with serverless applications. He is particularly interested in distributed architectures, domain-driven design, and functional programming.
Jeremy: Hi, everyone. I'm Jeremy Daly and this is Serverless Chats. Today I am chatting with Vadym Kazulkin and Christian Bannes. Hey, Vadym and Christian. Thanks for joining me.
Vadym: Hi. Thanks for having me.
Christian: Yeah. Thanks for having us.
Jeremy: Awesome. So you both work at ip.labs in Germany. And so I'd love to talk a little bit about what IP labs does and what you two are about. So let's start with you, Vadym. So you're the head of Technology Strategy. So why don't you tell the listeners a bit about your background and what ip.labs does?
Vadym: Yeah. I'm a Ukrainian native, but I live for 20 years now in Germany. I have been working with Java for 20 years but since three years, I'm involved in the migration stuff and AWS as a cloud provider of our choice. And I'm a part of the serverless community since two and a half years, involving me heavily in all this stuff and presenting ideas and our experiences mainly also with Christian. So this is what I do.
And ip.labs is software provider for designing and purchasing of photo products like prints, calendars, photo books, just where you can print your emotions. So they are part of the Fujifilm group, Europe. So founded 60 years ago, approximately 80 colleagues, 30 developers.
Jeremy: Awesome. And Christian, you are a lead developer there. So why don't you tell the listeners a little bit about your background?
Christian: Yeah, right. So I'm a software developer at ip.labs, also working about 20 years, almost only with Java technologies. So I'm working as scrum team. I think about three years ago we adopted serverless and we switched to TypeScript because it fits our need more than Java. And yeah, we are quite happy with serverless.
Jeremy: Awesome. All right. So I have seen the two of you give a presentation. I know you've given this presentation a few times, about measuring and increasing developer productivity with serverless. And this is always to me a fascinating topic, because you see a lot of claims, right? And a lot of it is very anecdotal. I mean, it's like, "Oh, yeah. We were able to move faster with serverless." Or you hear things like that. But the two of you actually sort of did some research on this, dug into the background of this and really outlined this well, and I think it should be super important, or it's super important to share with listeners so that they understand why serverless is such a powerful productivity booster for software development.
So I'd love to start, like maybe way, way back in the beginning, and just talk about software development in general. So when you're building applications, and you're trying to create whether it's new stuff, and you're trying to build greenfield applications, or you're trying to work on legacy applications, what are the challenges when you are trying to, as a software developer, what are the challenges that you have to face?
Vadym: So I think that the best model to explain this is the cognitive load. And this is the term coined by Matthew Skelton and Manuel Pais in their recent book, Team Topologies. And the cognitive load is the total amount of mental effort being used for accomplishing the task. So accomplishing the software development tasks. And then they differentiate between three of them: intrinsic, extraneous, and germane. And probably intrinsic, is very easy to understand, because it's how you write Java class, TypeScript class, or use some framework of the day. So this is something that you can't offload, you have to learn this. So you have to own this also.
But then you have this extraneous load. And it's especially important to understand in our distributed world that many things are currently distributed. So just how to automate your tests, unit tests, integration, end to end web, mobile tests. How to build package, deploy, and run your application. How to configure, monitoring, alerting, logging, everything. So just operate and maintain infrastructure. So how to build fault tolerance system and resiliency and of course, security is also job number one. So just it's not only application security, but also preparation system, networking, hardware, everything. And just huge bunch and you haven't written even one line of productivity code, but you have to deal with all this stuff, probably. And I see a lot of companies which really struggle to deliver value if they go distributed because all of these challenges and distributed system are hard challenges.
Jeremy: Right. Yeah, then germane. So you've got intrinsic, you've got extraneous and then germane. What's germane?
Vadym: Germane is, this is your business logic. This your workflows, this is your core domain that you implement. So you have to become expert in the things which you are doing. So you have to understand what your core domains, what are your generic domains, like probably commercial system, e-commerce system, something like this. Every everybody needs payment, check-out, but it's probably not your core. So you'll have to reduce also this law to only things which really core and meant for your business. So these are three different cognitive load types. And if you think about this, so just, you want to reduce extraneous and germane load as much as possible to focus on the business stuff that matters.
Jeremy: Right. So the other thing we kind of talking about in here though, is like, again, if you're spending all this time writing or working on this extraneous stuff, obviously, it's taking away from you implementing something, and actually being able to ship some product. And this comes back to productivity. So what exactly do we mean by productivity too, because that's probably one of those things where I think people spin their wheels a lot and you check off a lot of to-do items, but is even figuring out how to implement something like to automate a task or to build and package and deploy your applications. And that's not really being productive, is it?
Vadym: Yeah, being productive means regular shipping your product, which of course used by the customers, but I think that some something very obvious but in our time the productivity and the speed really matters. So you have to offload or you have to try to offload as much as possible to focus on shipping things which really call for your product.
Vadym: So I think that's the idea for definitely productivity, from the word product, probably just... Could be the word.
Jeremy: Makes sense, right? Yeah. So what about things that are holding us back, though. I mean, so you talk about extraneous things. And obviously, there are a lot of things that a developer has to be thinking about when they're writing their lines of code, or they're implementing something because everything they do, every line of code they write is going to have some impact down the road, like someone's going to have to maintain it or whatever. So what holds developers back from being productive?
Christian: So the problem is when you try to implement all the stuff yourself, and that's a problem that we had at ip.labs. So we try to implement really everything on our own. And so writing it is, would say, can be easy. So you have teams and they can do it really fast but the problem is that you have to maintain it for a long time. So we started our platform about 15 years ago and we implemented like payment and e-commerce systems and so on, which are actually not part of our core domain. And the problem is that all the stuff has to be maintained now for over 15 years, and that's a lot and you get more and more like technical debts, because you want to go on and on. You need new features but of course, you will have a lot that you have to maintain. And I'd say this, this holds you back a lot.
Jeremy: Right. And so let's let's talk about technical debt for a second. Because again, I think that people hear the term "technical debt," and they probably in the back of their mind they understand what it is, but how would you define technical debt so we can really get some concrete examples here?
So you probably have to update your web application server which may force you to update the version of your programming language and so on. This is some kind of situation that you are steadily forced to update thing. And if you own too much, then you also have to update too much. So there are also some funny situations which we experienced, we use some encryption algorithm and our payment provider forced us to update the strength of the key. We updated this, but this open source breakthrough threw the exception, and we saw, "Okay, this project can't deal with this key." So we searched for the newer version of the project but the project was discontinued. So we had to take another one and reimplement the whole thing.
So it was the perfect decision several years ago, and now we have to do this stuff and it's not very value-generating currently, but it's some kind of must have. This is security. And this is just a lot of things happen in our industry which forces us to do those things and I think developers like this migrations, like this upgrading, but generally, this is what holds us back from being productive with our product.
Jeremy: Yeah, right. And I love that idea of you know, saying technical debt is not suboptimal decisions, right? It's not like I said, "You know what, I'm going to cut some corners here." And that's going to give me a bunch of technical debt. Now certainly, that will give you technical debt if you make those types of decisions. But you're right, you can make a perfect decision at the time, right, and even thinking forward years down the road, you can say this will most likely be the right choice. And it's certainly the right choice at the time. And then you just end up with like you said, things go out of date. I mean, like they deprecate you know, node six, then node eight, and then eventually, node 10, for supported Lambda runtimes, you know what I mean?
So things are just going to start disappearing. So one of the things though, that I think contributes to technical debt is obviously, like you said, okay, you pick some service, some maybe open source package to do something for you. And then that package eventually goes away. If that package didn't go away, it was just a matter of upgrading it, well, maybe that's not too difficult, maybe the API doesn't change too much, maybe there's not too many technical changes, or breaking changes. But the bigger problem is, is that if it goes away, or if there are significant changes to the API and the interface, then you have to go and change code that you've written. And it seems like in almost every case, technical debt is highly related to the amount of code you have to write.
Vadym: Yeah. That's true, it's related to amount of code and it's also related to amount of dependencies used. Like open source project, programming languages, database drivers, web applications, sort of everything is dependency and everything will be changing, and it will be forcing you to upgrade. So this is some kind of the circle, so the only one solution is to own as much code as possible for you, and this code should have as little as possible dependencies, just enough dependencies, I would say.
Jeremy: All right, so now you're making decisions in you own some of the code, but you still going to make decisions, because again, you want to offload some of that undifferentiated, heavy lifting, as we talked about. So you are going to want to use some open source things or some third-party tools, managed services. So how do you then make a decision today that hopefully reduces the need for maybe re-architecting an entire system. Like how do you build code and build applications, so that you can upgrade them as things change with as little effect on the entire application as possible?
Christian: So one way to organize it is with evolutionary architectures. And the idea comes actually from evolution. So the environment constantly changes, so we have to adapt. And of course, some parameters are constant over time. But it's like the climate in Europe was different 10,000 years ago, and then the Ice Age ended and stood constant for a period of time and now it's changing again. And because it's changing, we have to adapt. And the same with architecture. So you have an environment and the environment is like business requirements and the technical environment, and both are changing. And because they are changing, you need to have the ability to change your architecture.
There's the saying, never touch a running system. And the assumption behind that is, if something stays the same, it won't break, right. But the problem is that even if your system stays constant, the world around you is changing. So for example, if you take a laptop, and you put Windows or Linux on it, and you close your laptop and put it into a cupboard. And say you leave it in the cupboard for two years, or four years. So what happens if you take it out again, and you open it. And what would happen is it would install a lot of stuff and update a lot of stuff. And maybe if you have interfaces to the outside world, maybe some things would even break.
So you didn't touch the system, but your system could break. And the reason is, the outside world is evolving, or is changing. So it's not possible today, to say, "Okay, I'll leave my system constant, and I don't touch it anymore." Because the outside world is changing. So you have to have some ability to change your architecture. And the idea behind evolutionary architecture is to build this ability in to this architecture. And change is easy, if it only affects a small part of the system. So if you need to change something, and you need to change the whole architecture, this is really hard. But if you change just a small piece of it, is possible to just change a small piece of it. It's easy to make a change. And evolutionary architectures actually have three components. And maybe then we will also understand why serverless is a really good fit for evolutionary architectures.
So the main three components for severless architecture is a fitness function. The fitness function is basically business requirements, maybe you can automate it like webpage should respond in like 200 milliseconds. And the architecture should evolve in this direction of your fitness functions. And the second thing is the so called granulum. And the granulum is the smallest deployable unit, or the smallest thing that you can change independently. And in serverless application, the smallest thing that you can deploy is a Lambda function. So it's really small, it's based on the function level. And this makes it easy to change. But there is another thing, that's also very important. And I think this is one of the most hardest thing to do right. And this is appropriate coupling. So now the question is, what's appropriate? And appropriate is things that belong together, should stay together.
So if you change one Lambda function, always when you change one Lambda function, you also have to change the other Lambda function. So does it really make sense to separate them into different deployable units, or maybe into different Git repositories. So when you make a change, you would have to check out maybe multiple Git repositories, you have to declare multiple lambdas. So this would make change really hard. So things that belong together, should stay together. But it's really hard to know what's appropriate. And I think you will make a lot of mistakes and lot of errors in this area. So during your evolution or during your implementation of your application, you will probably recognize that you had some parts that are loosely coupled, but should stay together. And maybe you would also recognize that things that are highly coupled, should better be loosely coupled. So as you get more insights about your architecture, you should always refactor your code to match the appropriate coupling and the granulum. This is a really hard thing, I think. But this is really important to get evolutionary architecture, right.
Jeremy: Right. You know I totally agree. I mean, that's one of those things where understanding where the bounded contexts are, and some of those things can get really difficult ...
Jeremy: ... And then also just what we mean by a single purpose function, what that single purpose function does, and how that connects to other things, whether that's using orchestration or choreography. There's a lot to think about there. So I want to go back to that actually, in a little bit. But before we do that, let's talk about serverless in general, because you brought up why serverless is so great for these evolutionary architectures. So what is the value proposition of serverless when it comes to this idea, of not only building evolutionary architectures, but just like being more productive as software developers?
Vadym: Now we have mentioned for the first time the term serverless is it's quite unusual for your podcast after 20 minutes. But generally, this was the some kind of preparation job that we have done. And if we are talking about the value proposition of serverless, it's really huge. It starts with such obvious things like no infrastructure, operation, and maintenance, and this is part of extraneous load. And ask yourself, if the infrastructure management, and operation is core for your business? For AWS is probably yes, but the most of the companies, it's probably no. Even Lyft is spending $300 million per year on AWS. And they can spend less in the data center but this will slow them down.
Vadym: So just it's a decision. And also auto-scaling and fault tolerance built-in, it's also a huge part of the extraneous load, and it's just part of the platform, and ask yourself, how difficult it is, with all this capacity planning and so on. We at ip.labs we have huge Christmas business for three weeks with a huge spikes because people are making gifts, photo book of the year, and so on. And I really know how difficult it is and doesn't make any sense to own too much infrastructure for only two weeks, just 10x, 20x from just doesn't make any sense for us, even to think about this. Because we are also business to business companies, our forecast so the our partners should bring us forecasts. So they don't know, and how should we know, just it's not possible for us.
And, of course, this is the idea, the next idea to do more with less just in case of greenfield project, you can make prototypes very, very quickly. So just you don't own infrastructure, you pay on demand, it costs you really nothing to prototype. So just you can do things easier. And probably by relying on managed sources, we'll really talk about this. Just you can do more with the same amount of people. And I think that's matter. So you have static, so fixed amount of people currently and can you can really do more with them.
Vadym: And it's really powerful if you think, "Okay, I can offload some really nice technical things to the people, to the platform." Which for them it's the core thing, and they can help us to do things quicker. And then we are, yeah, we are talking about the technical debt or to have low technical debt. So we're talking about amount of code. And to minimize this and really like [something Johnson? 00:23:48] was very known in the service community. Whatever code you write today, it's the technical debt of tomorrow.
And it's just true. Just as we have explained this and the best code is no code at all just that I know the codes. But also you have to think that configuration infrastructure as code is also a part of your code. So you can't separate this, as the whole is your code. And of course, think of how much time do you spend maintaining your solution over the whole life cycle. And implementation is sometimes quickly but my experience and what I've read you spend 75% of the time maintaining your solution, and it's huge part. So you have to think about the entire lifecycle of your application, how to reduce your effort at maintaining it because maintenance doesn't have too much failure with just this, you have to do this and you have to free you up.
And then of course there are obvious things. So if you can free your app, then you can focus on your business value and innovation and have faster time to market. And it's probably every company want just this. Everybody's crying, we want to be innovative, we want to be fast. And that's what matters.
Christian: Yeah. And I think this is really a problem that we had at ip.labs. So we really try to implement a lot by ourselves. But actually, what you want is you want to concentrate on the core domain and not on the sub domains. Like sub domains, like payment or authentication, and effort in search, possibly you could do it really fast. But you have to maintain it for all time, and that's the problem. And ideally, what you want to do is concentrate on the core domain and don't implement the, like in domain-driven design you call it generic sub domain and supporting sub domain. And if you are able to use a managed service, what you get is, for example, free bug fixes. So you don't have to fix the bugs, because they will fix the bug, you don't have to do the operation, you don't have to do probably the scaling.
And you will also get new features. So you don't have to implement those new features by yourself. But another team will do it for you. So you can concentrate on your core domain because this is where you have your competitive advantage. You don't get any advantage from your subdomains, they're only there to support the core domain. And that's why it's really important to write less code in subdomains. So you can write more code for your core domain.
Jeremy: Right. And I'm glad you brought up domain-driven design again, because this is one of those things, I had a conversation on another podcast about this. And just this idea of how you implement these microservices using serverless, right. And so I know, at ip.labs you have some experience doing this. So what's it look like in practice when you're building microservices using serverless?
Vadym: Okay, just that it's about our service mindset. And I've seen this, this tweet from Jared Shorpe and about how to proceed and we have adopted this to our realities. But generally, serverless is really an operational construct. So our idea is to be as much serverless as possible. So that's services which are completely serverless, like Lambda, S3, S-Square, EventBridge, and there are services which are a bit less serverless, like Fargate, probably also Kinesis because you have to manage your files and so on. And then you have AWS and so on. And just think we ask ourselves, can we be completely serverless? So as much as serverless as possible, without dealing with capacity planning and so on, even DynamoDB they're not now offer capacity completely serverless mode without calculating read and write capacity units and so on.
And just the first thing to ask yourself, can we implement this completely serverless? Does AWS offer such as service which can help us? And then the answer is yes, then and the feature set is good enough for now, I would say then we will always choose serverless currently. In case it's not possible then we are asking, "Is there any AWS service which offer this service but we have to manage a bit?" And then it's the decision number two, and so on, sometimes even we have to choose some service which is outside of AWS like PagerDuty because we also have parts outside of AWS and they have to build incident management for overall system. So just this is some kind of other options. I think really, one important option is to reconsider your requirements. If they had the situation wonderful conversation for using BPMN as a workflow management system. We had the same experience but we said we just currently, step function didn't have enough features for us. But we could reconsider our requirements and still use step function because it's completely serverless.
But now they provided this feature. So this was the right decision to stay within the serverless ecosystem. Because it grows, it improves steadily and it will be... If it's now not the perfect decision, it will become one in the future. So we are constantly thinking how to embrace this system and stay with this system. But of course, if it's your core domain, you have to write this code and you have to own and also run this code, but it's the last option to choose to write the code by ourself.
Jeremy: Right. And I actually I really like that thought of saying that you build something now you maybe change requirements, but then you know that that service is going to get better. And it's going to have more features, right? So it might be even a better choice in the future, sort of the opposite of technical debt, right? It's going in the other direction, which is sort of a good thing.
But so I get the mindset, right. And I think the mindset is super important when you look at that, but then when you're actually implementing it, when you're following through and building these different services, how are you organizing Lambda functions? How small of units do you break it down into? I know, Christian, you said that it's a really hard problem to figure out you don't want too much coupling, you don't want too loose of coupling either. So how do you build out your projects? How do you kind of think about those for long scale, architectural planning and things like that?
Christian: Yeah. First, I would like to say a few words in general about architecture, because I have people saying that for serverless applications, so actually, you don't need to think that much about architecture, because Lambda functions are so small. And if you do something wrong, you can just throw the Lambda away and write a new one. And of course, this is wrong, so just because you're using serverless, doesn't mean you don't need architecture. Architecture is usually a long term investment. So it doesn't really have a big impact, maybe after months, but after years. But this also means for really small applications, maybe it's true that architecture isn't that important.
But for large scale applications, it's really important. So if you have large scale and really complex domains, then it gets really important. And this is true for a Java application based on Spring Boot and this is also true, of course, for serverless applications. So when you have a large scale application, I think the idea of domain-driven design gets important again, and as I said, this is true for other applications or other platforms. And this is also true for serverless applications. And this is also the way how we try to organize our serverless code. And so we have a large domain and we try first to split this large domain into smaller sub domains. And so our sub domains are still too large so we further try to organize those sub domains into smaller bounded contexts, and inside a bounded context, recreate a domain model.
And so we usually don't deploy Lambdas independently. So we use the serverless framework and we bundle everything that's part of the same bounded context or the same service into one serverless yaml file and deploy it in one unit. So we always try to organize around business capabilities. And I think this is really important.
Jeremy: Yeah. I think that's a really good way to break it down. And I think the strategy that you see a lot of companies that are maturing with serverless are going down that path, right? I mean, you see some companies just building monolithic applications where everything's kind of flying around. But what about like data separation and things like that? How do you break up your data between those different contexts?
Christian: Yeah, I think this is a really interesting topic. So I think one question that arises is should I use one? So we are using DynamoDB. It's also a serverless database. And the question is, should I use one table or should they use multiple tables? So multiple DynamoDB databases. And if you look at the documentation from AWS, or watch some re:Invent videos, they recommend you should use one table per application. And you should understand why this is the recommendation. And the reason why you should use one DynamoDB table instead of multiple tables is, when you come from a relational database background, usually you normalize the data. And that means you create one table per entity. So you have a user table, you have order table, you have order item table, and so on. And if you, for example, wants to have a query, if you want to query all items that use them, you will have to do a join over user over order over all items.
And this is really flexible. So you can do ad hoc queries with SQL. But the problem is, this is not scalable. And this can be a huge problem. And DynamoDB is a scalable database and it scales by avoiding joins completely. So with DynamoDB, you should not do any joins. And internally, DynamoDB is working with partitions, so it has I think it's at max 10 gigabytes per partition. But you can add partitions to the database. And so it can scale almost infinitely. But of course, you still need to model relational data also in DynamoDB. So how would you organize this? And you would do it by denormalizing the data.
So of course, this means that you have to duplicate a lot of data but the idea is that your items are already joined. So if I give you an example, say we have an order and you would have the order ID as partition key and the order item ID as a sort key. And then you can do really one query based on this hash key. And you would get back the order with all the necessary information of the order item of the user and so on. And this is actually the reason why you should use only one database table, because you can now avoid joins.
So what does it mean for microservices? Because when we talk about microservices, so actually every service should own its own data. So you wouldn't do a join between different microservices. So it's actually okay to say each microservice should have its own database, it's totally okay to do it. But of course, you will have some additional operational overhead. And yeah, maybe you will start with five DynamoDB tables. But think about what will happen in maybe five years or 10 years? Will you have 50 tables or 100 tables? Will still be okay for you. Because so there's a little operational overhead, it's much less than with relational database, but you still have to do like monitoring, throttling, and so on. When you have a large number of databases, this might be too much. Can be okay, but it might be too much. And that's why we decided to share a database across our services.
So we have just one DynamoDB table per sub domain but we share it across multiple services. But still, every service is only allowed to access the data of this service. And we make sure by using fine grained access control. So in our policy, you can configure that service only allows to see the data of the service. So there is no possibility to break something. But of course, it might be a disadvantage, because now you have multiple services, and they aren't completely independent. But for us it works really good. So we have one DynamoDB table. It's not dangerous, because every service can only see its own data and we don't have this operational overhead.
Jeremy: Right? Yeah. Now, if you want to talk about cognitive load, thinking about how to structure data in DynamoDB to be shared across multiple services, it's certainly something you have to really think about. And I do like that idea. And every time I see AWS make that recommendation, one DynamoDB table per application, I always think of, "What do you mean by application?" And if you have a system with microservices and you have 100 different microservices, well, each one of those microservice or each microservice might be its own application, if you think about it that way.
Christian: Yeah, that's right.
Jeremy: But it's interesting because I always like to see how different companies implement that, because in some cases, it does just seem like it makes more sense to share a table across multiple services. But like you said, as long as they're in the same domain, and it's using the same sort of descriptive language and the same vocabulary, then it's probably less of a risk than if you were sharing across multiple domains, for example. So that's really interesting. So what about ports and adapters? So I know, hexagonal architecture is something you've mentioned in your presentation, and so this is something that you invest in heavily and just in case people don't know, can you explain what do you mean by ports and adapters?
Christian: Yeah. So ports and adapters is an architectural style. So it's actually a very easy idea. So you have your domain or your domain logic, and you try to separate it from any infrastructure logic. And you do it by implementing ports in your domain logic and different adapters, like DynamoDB adaptor or email adapter can plug in into this port. So in the middle, you have your domain code, and around this domain code, you have your adapters that you can switch. And the whole idea is you should try to separate infrastructure logic from domain logic.
Jeremy: And I always say this is something that I liken very much so to like a data access layer, right? Where you sort of genericize your database, not an ORM, we're not talking about ORMs, but like a database access layer, where you'll have a "get customer," you'll write that function, and then how that interface is actually implemented to the database, that's a completely separate thing. So you can always switch that out, if you ever needed to change how that works in the future.
All right, I want to go back for a second, because we talked in the beginning about this is about increasing productivity. We said productivity was this idea of shipping code and whatever. How do you actually measure this success, though? Because that's something where, I mean, I know there's a really good book called Accelerate that kind of outlines some of these success factors. But how do we measure the success in an organization and how does that tie back to serverless?
Vadym: So you, you mentioned this book, Accelerate by Nicole Forsgren, Jez Humble, Gene Kim, they're probably very known in DevOps community, continuous delivery, continuous deployment, and so on. And what they say, but what they found out is they wanted to investigate what make organizations vary performance. And currently, there are two things, the quality and the speed. And the message is to be successful today, you have to combine both. And they define metrics called four key metrics and two of them are for the speed, this is deployment frequency, how often do you deploy? But of course, it's not that you should deploy every change, you should be ready to deploy this. But in case the business says, I need this life.
And the second metric is lead time for changes. So how quickly can you deploy? So how much time do all your tests take? And so on, the whole pipeline thing. So how quickly can it can I deploy my code into production? It's about automation, and so on will probably also talking about, and two other metrics are about the quality. And these are time to restore the service. So if you see something went wrong, how much time does it take to go to the state that everything is working automatically, or by back fixing and just doesn't matter. And the second one is change failure rate. So how often you deploy things and also then break things because it happens if you are very quickly. But the thing is, you have to restore them quickly. And of course, they divide organizational different performance types and, of course, nobody wants to be low performer, but just they have some guidelines.
What do they mean by becoming high performer and then you see the things that in terms of deployment, it's something mainly times per day, it shouldn't be that like Netflix or Amazon, they deploy 1000 times a day, it's probably not the case. And even if you are in the mobile market, it's not very obvious to deploy and update so quickly. But in terms of lead time for changes, it's really about hours, minutes, or hours today. And the time to restore server it's the same it's for high performance organization. It's far less than one day in and that's really these are the four metrics, which are really important. And they're not tightly ... not very tied to serverless. But as a general recommendation, what outcomes do we want to achieve. And these four key metrics are outcomes, and was best practices and so on belongs to do the things how you want to achieve them.
Christian: Measuring productivity is actually really hard. So I think many agile teams use story points. But story point isn't actually a good metric to measure velocity, because it's actually very easy to optimize this metric. So I could optimize it by say, I have a task, and instead of giving 10 story points to this task, I can give a 20 story points to this task. And I'm twice as fast right? So, of course, I'm not twice as fast, and I just changed the estimation. And so it's not a good metric. So what you actually want to measure is the lead time. So you get a request and how long does it take until it goes to production? So this is actually what you want to measure.
But it's actually also not easy to measure it, because when do you want to start? So do you start when you get the first phone call? Or do you start when the manager accepts the request? And that's why you have this suggestion from the book Accelerate. And they say you could measure the lead time for change. And the lead time for change is basically how long does it take from commits to production. Because if you commit something, you don't add any value, you only have value if something goes to production. Because when something goes in production, then you have an outcome. And you don't have an outcome when you just commit and doesn't go to production.
So this is actually the base metric, the lead time for change. But the problem is, if you only take this metric, you can have a high risk, right when at commit and everything goes immediate to production, you can have a high risk, and maybe you can increase your failure rate. And that's why you also have to look at this other metrics, like deployment frequency, which is actually something to reduce risk, because it's actually proxy metric for batch size. So you try to reduce the batch size. But measuring the batch size is hard and that's why you use this proxy metric deployment frequency, which reduces the risk.
And you also want to look at the failure rate. So you want to reduce the failure rate. But as we know, it's really hard to really make no mistakes. And that's why you can also look at the meantime to recover. So when it's a really, really short time like say, one minute, you have failure, and after one minute fixed again, the impact is really small. So it's not that bad if you make a mistake. And that's how this idea of those four key metrics... that's actually the idea of the four key metrics. So the base metric is leads time for change. And the other metrics are supporting this, this base metric.
Jeremy: Right. And so that's, and I love those metrics because you're right, I mean, I think in the book, they talk about how you can just, you can easily manipulate a lot of these other metrics to make them seem really good. But you can't really fake things like, you know, the deployment frequency, you can't fake how long it takes to fix a problem. And the number of problems that increase, I mean, other than hiding that information, it's really hard to fake.
So how does serverless though just because of the practices that I think have been developing around serverless, how does that help with all of these metrics? Like how does it increase those metrics and or, I guess, decrease some of the metrics, but how does that enhance that experience and increase productivity?
Vadym: So generally speaking, the authors of the book also talking about software delivery and operational excellence. So what best practices do we have to gather to know and to take in place to become productive and they have identified some of them like loosely coupled architecture, which really helps to achieve this and as far as we know, serverless enforces this. Of course, you can have monolithic Lambdas but I don't think that the people doing serverless is because of this. So just loosely coupled architecture is one of the points.
The second one they are talking about code maintainability but I think they mean evolution-ability. So this evolutionary architectures, and also the serverless is also about this because you deploy this smallest unit. And of course, there are other practices that you have to consider like chaos engineering and so on to inject failures to see how your system react on failures. And there are a lot of tools around serverless which can help you.
And of course, time to restore services, you have all this kind of supporting tools for blue and green, and also cannery deployments. So you can do this with API gateway, you can do this with Lambda, with Lambda you have aliases and traffic shaping, with API gateway you have stage variables, and also you can combine this with Lambda. So you just with all these things in place, you can really, you can really decrease this time to restore service. But generally speaking, and I think you mean, how does serverless relate to this? It's probably that Simon Wardley tells us that if something changes, and then the core evolution of practices also occurs, and now we see that the serverless and Lambda execution environment, it becomes commodity.
And that means to be productive with this, you have to apply other practices, which is the same as that you can be productive, if you use NoSQL database and apply all the best practices from the relational database, it simply doesn't work. So and this is also true for every evolution of practices. So you can be successful with method of yesterday. So just in case. So that's a lot of things you have to do on your organizational level. And Christian also talked about the practices which they apply in their teams, and so on. So there's a some kind of mind shaped cultural shift that that happens.
Jeremy: Right. And I think that you mentioned evolution again. We talked about evolutionary architecture, we talked about coevolution of practices. I mean, one thing that is evolving very fast is serverless, right. And what you can do with serverless, what the features are available, what services come out. And if we look at recent launches, I mean, just the other day, we got the extensions API, where now you can tap into the lifecycle of a Lambda function, and you can run almost another function running alongside of it to do different things. We've got EFS integration, we had RDS proxy as an extension of that HTTP API's. And we just have so many of these new things that have been launched. And that's great.
And everything's getting better. Lambdas and VPC's are getting faster, I don't have to worry about all ENI cold startup time, and that sort of stuff. But where do we still need to go? Because it's always great to say that serverless can get us or serverless can get us somewhere really fast. Right? And I know, one of the things you mentioned in your presentation is the last 10% trap, right? I mean, if we're building applications that we can get to get something really fast and get almost all the way there, that last sort of 20% is super hard. And then that last 10%, oftentimes, we just have to give up and go back to something else. So how do we avoid that last 10% trap with serverless, like what else needs to be added so that we can sort of get it to be the default choice for everything that we do?
Vadym: So this 10% trap, the authors of evolutionary architecture book, they wrote the article that they compared all architectural styles, and this was the some kind of statement that serverless often suffer from this 10% trap. My personal opinion, it's probably used to be the case, but currently, it really disappeared. What is easy with serverless, if you can't solve some partition problem with serverless you can easily switch to other architectural styles, like containers and so on, you are not forced to be completely serverless. It's not that difficult to switch. And you have mentioned a lot of changes that has happened even elastic file system for measuring cloning and artificial intelligence that you can now attach one or multiple elastic file system to your Lambda and so on. VPC cold start is reduced. And other things like JS proxy, Aurora serverless, or data API, and so on. Just a lot of things has happened and even extension API, which was released recently and cloud watch Lambda insights in preview, which was released recently.
So there's a lot of things happening which will reduce this gap. But generally speaking, even cloud watch service has improved over the last year with the possibility to search in the multiple log groups in cloud watch log insight, which gives you the language to search in your log and even embedded metrics format to send your logs asynchronously. So a lot of things happening, but probably there are some more steps to go for the platform to be mature, to improve. So I really like elastic file system but I think S3 is really, really superior, because it's really very good integrated with all this events. So if the file is created or updated, you can easily call S3 and it's not possible with the elastic file system. And also, all the compliance services are deeply integrated with S3, like AWS config and so on. And all these services are important. So this is, if the people will use elastic file system within Lambda, they also have to ship also the services.
And probably very sensible topic, but I think that cloud watch, a lot of improvements, as I mentioned, but in terms of observability, and alarms, there are a lot of third-party software, the services, which are really superior, they are. So they offer really a much broader experience but of course, it costs you money. So my desire is really to stay within serverless ecosystem in AWS to use Cloud watch and not to go outside because I have to shift my data outside to think about security and so on. And just generally, I would like to stay there. And that's probably hard to improve. And of course, this situation with Xray support, it's really very important service but sometimes it's other sources, like if you've been breached them, they're missing this possibility. So many services, which are called asynchronously don't have this possibility currently.
I know that AWS ships this the same game was closed for SQL, SSNS, and so on, it takes time. But sometimes you wish that this extra functionality will become available from day one if you want to use the service because otherwise you have gaps in your observability and so on. And of course code commit. I know Christian uses code commit in his team. But this service is very, very basic. So I see that many people only use code deploy and other code, and for commit and build and so on other services. So this is something that code commit is currently not comparable, not nearly comparable to Github and Bitbucket.
You can only do basic things, but sometimes that's enough. But if people have gathered experiences with other services, then they want something more. Yeah, with more functionality.
Christian: So actually, we are using still using big Bitbucket and then a Jenkins pipeline pushes this code to code commits to make it available to code pipeline. That's how we work with code commit. Yeah, I agree with what Vadym. So when you use a framework, it can make it really fast until you reach the edge of your framework. So when something is not possible with a framework, then it can get really hard because then you have to fight your framework. And it's the same with serverless. So when you reach the edge, so something that's not possible with serverless and can get really hard. But I also think that a lot of gaps were closed in the last month and years. So it has less and less restrictions.
Jeremy: Yeah, definitely. And I totally agree on the observability stuff with you. I mean, it is hard to be tracing everything through your system and the third party tools work really well for that. It is going to be interesting to see what people do with that extensions API. And how much more insight that'll give us into the Lambda lifecycle, but again, you still have EventBridge, and these other things that still need that Xray support to kind of trace all the way through.
All right, so we didn't even get a chance to really talk about total cost of ownership. I think we did in some contexts, regarding reducing the amount of employees that you need or developers you need. But I'd like to finish up and just ask each of you to give me like, what's your top recommendation for people building serverless? Like, what's the one thing you would say to them, like "Here's the absolute thing you need to implement in your organization, if you want to go serverless."
Vadym: One thing I would say, through DevOps, so just no separation between the Devs and Ops. There is really a good page also from the authors of the team topologists. This is the DevOps topologists, and they have shown a lot of best practices, but also anti-patterns. And the best thing for the serverless is really to DevOps where the people are really working together as a team. And this is probably the hardest thing depending on how your organization currently works. So to get people there, to get the ops people there because you don't have any service, you can't install any agents. And it's something like cultural shock for them. But there is a really good talk from Tom McLaughlin, "What do we do if the server goes away?" And he explains a lot of challenges, where their Ops people can take charge with things like alerting and monitoring for the whole system.
And also, they have the better feeling about the restrictions of each service. Is SQL, SNS or EventBridge, or some kind of combination is the best solution currently for the next several years, probably. Because they understand those restrictions very well. They look into this, they did this with the storage, they did it with the database, they are really very sensible. And of course, chaos engineering can gain days. So a lot of challenges. So even Ops people don't have to be scary.
But it's a learning curve. But it's also a learning curve for the developers. Because distributed systems Microsoft is also hard for them. So just it can become a win-win situation if both parties learn and learn together, then you have really good chance, though, to embrace serverless correctly.
Jeremy: Love that. Great advice. And Christian, what about you?
Christian: So I'd say one of the most important thing is you really must automate everything. So it's okay to try something out on the AWS console but not in your production environment. In your production environment, you really must automate everything, because you have so many small parts and if you start making manual changes, you will get lost very quickly.
Right, awesome. Yeah, totally agree with that. All right. So gentlemen, thank you so much for joining me, this has been an awesome conversation. So if people want to find out more about what the two of you are doing, what ip.labs is doing, how do they get ahold of you? How do they find that stuff out?
Vadym: You can find me on Twitter and on LinkedIn, with mine first and last name. I think you will put it into the show notes, Jeremy, so because it's very difficult to pronounce my surname. So I'm really active on Twitter, tweeting and retweeting about the experiences and Christian and me we are talking at various conferences about the experience, and we will probably continue doing this because there is just so much to learn and to talk about in this community. So we will definitely be active I think.
Jeremy: Awesome. And Christian, I know you don't have a Twitter account or not a very active Twitter account, right? So I have some email addresses here, firstname.lastname@example.org and yours as well, Vadym. So I will put those in the show notes. I will also put the link to the presentation. There's a SlideShare of this, we'll see if I can find one of the videos of this presentation because it's fascinating. The topic is amazing. And I really love this idea just of where serverless can take you and where it can bring you on that productivity, sort of that productivity spectrum. So again, thank you both for being here. This was amazing.