Episode #75: Achieving Operational Excellence with Taavi Rehemagi

November 16, 2020 • 41 minutes

On this episode, Jeremy chats with Taavi Rehemagi about what's still missing with serverless observability, what modern cloud monitoring and operations strategies look like, and how to continuously implement best practices to ensure well-architected applications.

Watch this episode on YouTube:

About Taavi Rehemagi

Taavi Rehemägi is the Co-Founder & CEO of Dashbird, a serverless monitoring and intelligence platform for building and operating complex applications on AWS environment. He has over 13 years of experience as a software developer and 5+ years of advocating for the serverless revolution and building Serverless applications at various organizations himself.

Twitter: https://twitter.com/rehemagi
Dashbird: https://dashbird.io/

Watch this episode on YouTube: https://youtu.be/xeF19VCuoV0

Transcript

Jeremy: Hi everyone. I'm Jeremy Daily and this is Serverless Chats. Today, I'm chatting with Taavi Rehemägi. Hey Taavi, thanks for joining me.

Taavi: Hey, thank you, Jeremy. Nice to be here.

Jeremy: So you are the CEO and co-founder at Dashbird. So why don't you tell the listeners a little bit about your background and what Dashbird does.

Taavi: Sure. I've been a developer myself for pretty much my entire life. I started coding when I was 14 and since then, before starting Dashbird, I was an employee in two different startups. The last one I was working a lot on serverless. That was in 2016/'17, which led me and some of the team at Dashbird to found this company called Dashbird. We're an operations platform for serverless workloads. We help companies who are building on serverless to achieve excellence with their infrastructures.

Jeremy: Awesome. So we have done a number of shows about observability because observability and serverless seems to be that third-party offshoot that has been missing. There's a lot of things that AWS just didn't really tackle initially with a lot of the observability stuff. Now, they've added quite a few things, but again, it's nowhere near as easy to use as some of these third-party tools like the Dashbird are. So there are obviously constant enhancements.

They just launched, and we can get into this in a little bit more detail, but they just launched not too long ago, this idea of the extensions API for Lambda, which allows tools like the Dashbird or whatever, to have more control over the life cycle, if you wanted to have control over the life cycle of the Lambda function being able to get metrics and telemetry data and things like that. But I think there's still a bunch of stuff missing. I think you would agree with me on this, that there's more we have to do in order to understand and observe our serverless applications. So I'd love to get your input because I think Dashbird has sort of a different outlook or I guess a different roadmap for how you want to address the observability problems, and it's super interesting. So why don't we start there? What's missing in your opinion with observability and serverless?

Taavi: Sure. So I think first off observability is one thing we do, but when it comes to operating the serverless infrastructure, we're talking about high load like ad scale environments, there's a lot going on there that we try to help companies with. As an engineering team, if you're really building something that has hundreds or thousands of functions, for example, and a lot of different Cloud resources, then the one thing that's really difficult obviously is monitoring data and getting an overview of the activity going on across those resources and across your infrastructure.

But there's also, how do you detect failures and how do you get notified quickly and how do you respond to incidents and solve them? There's also keeping up with things like security and Cloud infrastructure for best practices, optimizing for performance and costs. So the monitoring this one part of the puzzle and then having been in this role where we were building a pretty substantial serverless infrastructure, there's a lot going on there. A lot of those things as a team you would have to build yourself and to figure out yourself and to construct strategies around how to improve. So that's really what we're trying to do for our organization. So we're trying to build an abstraction level for operational practices pretty much.

Jeremy: I love that because it's a more sort of holistic approach, I guess, to building a serverless. So building and managing a serverless application, as opposed to just sort of being responsible for, I guess, the monitoring aspect of it. Because again operational-wise ... and this is something, I forget who I was talking about this to, but essentially where it's like serverless or monitoring and observability in serverless is great when you get an alert that says something went wrong. But it's also really good and comforting to know that something went right.

Right? To know that events are flowing through the system and that the SQS queues are processing correctly and knowing that those things are working correctly and give you that level of confidence. I think that's really cool. From the Dashbird perspective, and again, I want to keep this a little bit more general. We don't want to just decide all about the Dashbird, but I really do love this perspective that you have. What is the vision in terms of being able to manage, not just the monitoring piece of it, but also the operational piece and implementing those best practices? How do you look forward or how do you plan a product that does that?

Taavi: When we started working on Dashbird, obviously we didn't come up with this vision in the first iteration. At first, we were just building a tool to monitor Lambda functions pretty much. What that came up early on was hundreds of people or companies who are actually struggling with this. And after all of those conversations, I think we kind of constructed this hypothesis around what this platform should look like for those teams that were the early adopters. So what Dashbird is today and what we're building it to be is this platform, you can look at it in three different pillars and I can go into those pillars if it makes sense?

Jeremy: Yeah, let's do that.

Taavi: Sure. So the first pillar that we have is a data centralization pillar. So what we do is we connect your AWS account without any code instrumentation. We don't use Lambda extensions or layers or instruments to code at all. Instead, what we do is we discover the entire Cloud infrastructure that you have and start ingesting all different types of monitoring data for those resources. So that includes things like log data, metric data, tracing data, configuration data, and really everything that the system is putting out externally. And from that extent of data, we're trying to understand the state of the infrastructure and to make that data available to the engineering teams, to be able to search and query and to interrogate that data in all different ways. So basically the first operating is to get everything in one place to break down the silos between logs and metrics and traces, and to be able to look at services and activity across different services and different resources. So that's the first thing that we do.

Jeremy: Well, let's talk about that for a second. So the idea of instrumentation, so this was something right from the beginning with Lambda that you really couldn't do? Right? I mean you can't install an agent somewhere that just listens to all the activity that happens with a Lambda function. Now we got layers, we got custom runtimes, mow we have extensions API. So there's different ways that within a Lambda function, you could add some type of instrumentation, even just wrapping the entire function in another function was one of the strategies that was used.

I know some companies would read off of your CloudWatch Logs and of course, just recently, we've got the ability now to attach multiple listeners to your CloudWatch Logs, so there's all these things that are evolving. But you don't have the ability to instrument all of the other things that are part of that ecosystem, so your SQS skews and your EventBridge and DynamoDB. They are logs that are there, but that's the thing where just adding some instrumentation to the Lambda function itself that's a very small part, I think, of your overall serverless application. So how do you make sense of all of that log data and connect all of it together?

Taavi: So really early on two things we discovered, the first thing is that Lambda is such a small part of the infrastructure and what really makes up for most of your infrastructure are things like SQS skews, databases, API gateway, there's a large surface area there that's actually as important as functions. The other kind of fundamental realization was that functions are more simple than you would have coding your containers or something like that. There's a singular thing that any one function is doing, usually if you design it according to the best practices. So the complexity is simple enough that it doesn't need code level instrumentation most of the time. And we didn't feel the pull of the market towards providing customers with really low level data.

So that's why we took this approach. For us, how we provide value for DynamoDB table monitoring, for example, our API gateways is that first of all, a lot of coverage to all of the resources. So if your API gateway is timing out or if it's having failures or there's an increase in anything, then that's automatically discovered and continuously checked for. Really what we're trying to do is to bring their meantime to resolution across any one resource down to as small a time as we can.

Jeremy: And so if you're not doing instrumentation within the Lambda functions themselves, so you're not capturing, like you said, low-level metrics. Just the data that pours out of Lambda. I mean, obviously you've got CloudWatch metrics and things like that that are really helpful. It'll give you failure rates and invocation rates and concurrency and things like that, but the log data itself sometimes has valuable information in it. But if you've ever looked, and I'm sure you have, looked at the log data that comes out of a Lambda function, I mean, it's a lot of junk, it's a lot of stuff that's just useless.

And if you think about log shipping solutions that just take those CloudWatch logs and send them all to some other system, whether that's a last research or something like that, that is a lot of data that you're storing that at least in my opinion is useless. There's things you don't need to know. How do you, I mean, I guess obviously I think your tool does this, but extract value from the log files? Because it seems to me like there's just a lot of junk in there that you don't need and you certainly don't need to be saving.

Taavi: Yeah, exactly and that's another topic that we spend a lot of time thinking on. So the thing with managed service logs is that they are high in volumes and low in density of value. For example, one API gateway request makes around 19 or 20 log lines and most of them are completely useless if it's a successful in occasion. And I can say honestly the same with Lambdas as well, there's a lot of noise. So in our case, what we do is we apply prebuilt filters on top of the log streams, so that if there's a code exception, if there's a timeout, there's any type of service specific failure, then we have a filter for that. Then that we automatically detect and aggregate to see how it happens over time and manage that as a failure scenario.

For a lot of the not so important logs we stored them away somewhere in Cloud storage, we don't keep it in the log analytics part of our platform where it's warm and it's expensive to retain. So I think that one of the important things is getting the cost down from processing the sheer amount of log data. The other part is equipping the engineering teams with the right filters and right knowledge to catch those known and unknown failures that can happen. So it takes a lot of time and effort to actually put together for each engineering team, what could possibly go wrong in my logs and what should I be monitoring for? So that's how you approach.

Jeremy: So then speaking about monitoring, that's another one of these pillars that you've mentioned in the past is the idea of actually alerting like. I mean, a good monitoring solution is going to have alerts, but again, you take a little bit of a different approach, I think, to how you alert.

Taavi: Yeah. I think when we really started to acquire customers, it was after we did alerting. So at first Dashbird was just tool that you could roam around in data and look at different things, but when we started sending emails, when there is a timeout or something, then that tripled the usage overnight pretty much. So you have to send meaningful alarms and that starts 70% of all the use cases. In our platform is just when somebody gets a notification. What we provide for our users is this coverage of whenever something goes wrong across your infrastructure that you should know about, we let you know. So we cover all of the API gateway failures, or if the latency increases for API end points, or if there is a delay in the queue, we manage that alert setting and alert handling. So I think that the situation is that if you have hundreds of resources, each of those resources has five or six different potential failure scenarios that can happen, so we try to put that over head.

Jeremy: Well, and I think that's an important piece of this too, is to say, you can set up a CloudWatch alarm that says when my SQS queue has more than a thousand messages in flight or whatever that is then I want to send myself some alert. And then you've got the ability to send another alert when the threshold drops or something or whatever. But I guess my question around this is even if you know what's happening, like even if you are a serverless architect and you've been doing this for several years, I think I could look at something like me personally and say, "I know what alarms, I probably want on this particular resource."

But what I certainly don't want to have to do is set that up on thousands of resources and do that. So automating those alerts is one, I think, cool benefit. And I know a lot of services do this as well, but just bringing your experience to understand what the patterns are and knowing when something is a problem. So again, is there anomaly detection or how do you set that up where or you automatically set up these alerts for these different resources based off of your experience with what the right pattern for failure looks like, I guess.

Taavi: Yeah, for a lot of those complex alarms, we're looking at historical data as well and seeing if that is the pattern. The changes basically over time is also a trigger condition for us. The way we look at this, like setting alarms, is that some things could be more critical than others. So we look at API end points that have error rates or high latency, more over something that's perhaps more on the downstream, we look at those user-facing things more critically, we look at something that causes a high delay or kind of affects the user experience more. We treat that as a more critical event than for example, something that's like just a little bit slower, abandoned, or not being used. So we try to kind of prioritize as well.

Jeremy: Yeah, no, I think that's super important, I mean, because again, things change over time too. So I could very easily set an alarm that says when my SQS queue goes over, whatever, 500 messages then I should be looking at that or I should send myself some alarm, but if that is slowly increasing over time and historically I'm getting more traffic, so now my SQS queue is backed up a little bit more and it's common for it to do that, having a system that can adapt and understand, I think is crazy important. But anyways, so the other thing though about, and I think I mentioned this before, is about understanding patterns and knowing what's the best way maybe to implement an alarm. Beyond just alarms, there's just best practices out there.

And AWS has a very good resource, it's the well-architected framework and specifically for serverless, there's the serverless lens. I love this resource. I suggest everybody go and read this resource if you're building a serverless application so you know what to do, what not to do. There's a tool that they have that actually allows you to track how compliant you are or whether you're following these things. But it's a manual review process, it's a matter of answering questions. So again, what's the way that in the future, we can ensure that these best practices are being followed without having to have a human go and keep looking at these things.

Taavi: So in Dashbird's, not to do much product placement, but when we discovered all the types of data that we have, if you have basically all of the monitoring data and you have alerts set up in your platform and realize that there's this opportunity to actually run a lot of analysis on top of that there's this a whole book or framework around what the serverless application should look like and what are the best practices around security, around operational excellence. And we kind of discovered that actually, a lot of this, we could find out using the data that we already have and build the system that continuously surfaces and pushes the user towards the best practices.

So today we have a collection of rules that we continuously apply and check for, and then get back to you with this list of, "Hey, your API endpoints are not encrypted, or you're not using the right encryption in your databases, or you have something that's unused or abandoned or not tacked." And a lot of those things we can service and push the user towards. We're trying to automate and equip teams with the best ways of following the best practices of the industry, so that's what we're building and having quite a lot of success recently as well.

Jeremy: No, I think that's really cool. Because I mean, that's one of those things where it's so hard. I mean, if you think about static analysis of code and you can catch some things with that. You could look at configurations and you might be able to say, "Oh, the security, you've got a star permission here or something like that." But until the code is actually running and data's flowing through it and you're seeing what happens there, that's, I think it's where the rubber hits the road there and you can see how that stuff works. So that is that's fascinating. Now I do have a question though, I mean, best practices and serverless are really hard and I know that AWS has their serverless lens for the well-architected framework and they make really good suggestions, but there's always a time at least for me, I do it quite a bit is you have to break the rules, sometimes to make something new happen. So how is your system going to deal with breaking those rules when it needs to?

Taavi: Well, that's the constant challenge is to keep the alarms adequate. And when a user looks at this and says, "Hey, I do understand why this is here, but it doesn't apply at all." And I think when we first started alarms were going off all over the place to be honest. Every time we removed the ones that are optional, but it's not easy. I think that the other thing that we can play around with is the critical level. How critical is something? If you're really exposed somewhere, then you should have a high priority alert. It's like, you're not tagging your resources perhaps that's not as important.

Jeremy: Awesome. Go ahead.

Taavi: I just wanted to say that we're never going to get to the perfect 100% with those insights, but yeah.

Jeremy: Yeah, I really do like that approach. And again, there's a lot of observability companies out there, there's a lot of log shipping companies, monitoring companies, all these different things around serverless and they all do a really great job. I mean, for what they're doing. But I do really like this approach where you're saying, "Let's take a step back and solve the observability problem, solve the mining part, but also solve that operational quality and operational excellence problem." I think that's an interesting approach. So good luck with that because I know it's not going to be easy. Like you said, you're dialing that in to get it right. So let's talk about just, I guess, monitoring your Cloud and your operations in general, because this is something where we don't always go deep into this when we are talking about observability.

But I guess a question that comes up is now that we're doing serverless and now that we're using managed services for a bunch of different things, what is it that we're actually monitoring for now? What are those important metrics? Because if you think about it, I don't care about CPU anymore or memory uses. My DynamoDB tables don't tell me how much CPU or memory they're using. I only know how many read units or write units I use and as long as the latency is where I need it to be those seem to be the metrics I care about, but those are different across all these different services. So what are we as an, I guess, as a monitoring community, if that's the right way to put it, what are we looking for as important metrics?

Taavi: I think there's two answers here. The first one is that serverless is essentially like a layer of that abstraction. So it abstracts as a way to underlying compute resources. So what we recommend our customers to monitor are user facing things like how fast are the responses from the backend, what's the downtime and quality of the service. Like how well are the users actually experiencing the system? And that's the first layer that we usually recommend them to cover.

So do I have alerts and API gateways, for example, things like that. But on the other hand it's, what's this microservice costing you for example, or how can you make it quicker a bit or those kinds of things. So really the business and the user impact is what we mainly try to monitor. I think the other challenge is that just making sense of all of that data that's the system is outputting is like if you have tons of monitoring data, then trying to extract the value from that and to identify pieces where you should be really focusing on. So I think that that's the challenges and the approach that should be taken. Another thing.

Jeremy: And I think that's interesting too just from a, I guess, a community or an education standpoint that observability companies like Dashbird almost have the ability to help educate people that are using these different services on what the important metrics are. And then not only what the important metrics are, but also maybe what the baseline for those metrics should be. You know what I mean? I guess the error rate on your API you'd love to be zero, but maybe the latency, for example, if you're connecting to a DynamoDB table through a API or a API gateway to Lambda, to DynamoDB table, like what that average response time should look like and things like that would certainly be helpful.

So then, I guess from a more operational standpoint, I mean, there's a lot of people who equate serverless to no Ops, which it's clearly not no Ops. I mean, you significantly reduce your operations and there's many other things your operations teams could do. They could focus more on security, on automation, some of those other things, but what about the overall responsibility of some of this monitoring? So, again, I like the approach where you don't have to instrument your code, so it just happens behind the scenes. But, I guess, where does the data come in when it comes to whether it's optimizing and following those best practices or optimizing for costs or performance or whatever, or just monitoring the overall health of the application, where does that responsibility fall now? Do you consider this to be a developer tool or do you consider it to be an operations tool or somewhere in the middle?

Taavi: I think that the trend we've seen in the serverless era is that a lot more responsibility actually falls to the hands of the developers. And not just the operations side, but also based on the business side or it brings developers more closer to the customers in a way as well, because the task is less on building undifferentiated value and more on actually solving the problem for the customer. I don't know if sadly or it's a good thing, but a lot of the operational burden also seems to fall on the developers. When we really talk to our users or customer calls, then we usually see engineering leads or developers, architects, not a lot of operations or DevOps people, to be honest, there are obviously, but I would say it's like 20, 25%. So it's more developers I would say. And I think a lot of what we do is around debugging still, and improving the system in general, security wise and things like that and those things are always done by developers we see.

Jeremy: That's a good point about debugging. So in order to debug your code in a again, a Cloud distributed environment, is that something where you need to be using one of these tools to do that? Is that what debugging looks like?

Taavi: So, yes, but I think developers can do it with CloudWatch as well. So when we break it down there's two user stories or two ways of using. First is while you're developing and iterating the application and trying to understand all the bugs and to fix them, then that's something that you need to iterate quickly and do deployments and test it out. And one way to do this is with us and we do have some things light tailing and real time representation of what the activity is looking like, which may be a bit more simpler than CloudWatch is. On the other hand, you can still do a lot of that in CloudWatch, or you try to develop locally as well. We see the bigger value posts in environments where that's already in production, have a lot of users, a lot of load. Then the monitoring part becomes a bigger challenge and tags where we would position us more, but we still see debugging use cases as well. It's just that you can do a lot with CloudWatch as well.

Jeremy: All right, so then what about the monitoring strategy? So you said that again, you see 25% or so of the people that are jumping on your calls are operations people, and that a lot more are the developers. So from a monitoring standpoint, I mean, typically you'd be monitoring to make sure that the CPU and all the servers are running? That's how we used to do it. And you might have an Ops team that does that. So what is the strategy now, so for a serverless team that's developing a serverless application, maybe there's somebody in operations that's helping with VPCs or something like that, but what is the monitoring strategy now? Is it the developers who should be in there getting those alerts, or is it still some hybrid dev ops solution that you're trying to mix and match? I mean, how would you suggest a team use one of these observability tools so that they can make sure that their applications are running smoothly?

Taavi: So if you're going into production with your serverless application, or if you're thinking about monitoring the general, what we push users towards is really set some clear goals for the monitoring solutions basically, or if you're building a monitoring strategy, what are the core things that you should be thinking about then? In our case, what we see being the most important ones first is the ability to quickly understand if there's an issue to quickly get notified and to reduce the time it takes from anything happening to your development team knowing about it and there's a lot of things you can do there. You can map out the failure areas where you have the most risk, and then you can map the end-to-end ways, or monitoring API end points, for example, or things that are really user-facing. Map those out then set alarms for those.

And the first part is really about getting notified as quickly as possible. So the second part that we think it's really important with monitoring strategies, having access to the right data at the right time. So if you're discovering that something is not working, you need to have the infrastructure in place to be able to understand why it's not working and to go through all of that data, to have that data available, and to show you where the problem is. I think those are the two main things for any monitoring strategy. If those are clear, then it's easy to make the lower level of decisions from there, like what types of data you need and things like that.

Jeremy: I think that's important. You said, something about understanding where the risks are in your infrastructure. Because one thing you see, certainly with serverless applications is I have not seen very many 100% or applications that are a 100% serverless, you always have some hybrid in there, you're still accessing a SQL database or a my SQL database or something like. So that's I think something that's interesting is what do you do to protect against the brittle components? Is this something where you add more alerting and more monitoring to that? Or is it something that it's just another piece of your infrastructure that you treat as just like you would anything else?

Taavi: Yeah. I think it's important to be aware of those areas, if those exist. If you have a SQL database or sometimes free service that can be easily troubled, then definitely designing around that is important or that in mind and treating it as a specific failure point, paying more attention I think is necessary and that's what we recommend to do as well.

Jeremy: Awesome. Well, so I guess my last question, just, I love to ask people this question. The future of serverless ... and this has been asked and answered a million times, and everyone seems to have a different answer to it, but just because again, I like how you're thinking about this approach to it. Are we going to see serverless dominating the Cloud world? I mean, is it just going to be the way things are ... or, well, let me take a step back, ask you this question: what's the future of serverless, what's it going to look like five years from now?

Taavi: So the way we see it and the future we're building for is the future of our developers is construct their applications out of Lego pieces and doing very little coding or only focusing on the differentiated value that their organization brings and having at their exposure, a lot of tools that they can just kind of piece together and use. And I think the creativity in Cloud is definitely that serverless is faster to build on, or the pricing is based on the actual usage and it's way more simpler to understand different components as well. So that's what I hope will happen and it won't just be AWS, but it will be this entire Google Cloud, Microsoft Azure, a lot of third-party services will play into it. And there will be a lot of different tools that you can choose from, but they'll all be managed in single purpose.

Jeremy: Yeah. No, I love that. I hope for the same thing and I think you're right. I think the ecosystem will continue to expand and Azure is doing some great things in terms of what they're building out for serverless. So it will be really interesting because having this conversation five years from now could be completely different, but anyways. Well, Taavi thank you so much for joining me and sharing all this knowledge. And I mean, again, the Dashbird, the product direction you have there is really interesting, it's a really cool approach. I love that idea of just trying to make sure that you implement those best practices and give people the tools to do that. So if people want to find out more about you or figure out what you're up to and find out more about the Dashbird, how do they do that?

Taavi: Sure. So Dashbird.IO is where you can contact me or our team as well. And my Twitter is @Rehemägi, so feel free to reach out and happy to chat about serverless anytime.

Jeremy: Awesome. Well, I will put all that in the show notes. Thanks again, Taavi.

Taavi: Thank you.

This episode is sponsored by New Relic and Epsagon.