Episode #2: Building Resilient Serverless Applications with Nitzan Shapira

June 24, 2019 • 34 minutes

Jeremy chats with Nitzan Shapira from Epsagon about building resilient serverless applications, what can go wrong with serverless, and what we should do to make sure our applications are working as expected.

About Nitzan Shapira

Nitzan Shapira is the co-founder and CEO at Epsagon, a distributed tracing product that provides automated monitoring and troubleshooting for modern applications. Nitzan writes for his own blog, as well as the Epsagon blog as a frequent contributor. You can find him speaking and helping out at serverless events across the globe, including Tel Aviv, where he recently organized the city’s June 4th ServerlessDays event. In addition to his contributions to the serverless community, Nitzan has more than 12 years of experience in programming, machine learning, cyber-security, and reverse engineering.


Jeremy: Hi everyone. I'm Jeremy Daly and you're listening to Serverless Chats. This week, I'm chatting with Nitzan Shapira. Hey, Nitzan. Thanks for joining me.

Nitzan: Thanks for having me.

Jeremy: You are the CEO and co-founder of Epsagon, one of those hot serverless startups out of Israel. Why don’t you tell the listeners a little bit more about yourself and what Epsagon is up to.

Nitzan: Yes, definitely. As you mentioned I'm one of the founders and the CEO of Epsagon. I'm based out of Israel and San Francisco, currently kind of in between. I'm an engineer, a computer engineer with a background in cyber security and embedded systems. It's more low level background. In the recent years also, of course, [I've worked with] the cloud all the way to serverless. Epsagon is a company focused on monitoring and troubleshooting for modern applications. So the entire field of cloud applications that are built with microservices, serverless, managed services, where you don't have access to the host, very distributed — how do you understand what's going on in your production? How can you troubleshoot issues as fast as possible? Do it automatically and in a way that is suitable for this kind of modern environment. For example, using agents is something that you cannot do. 

Jeremy: I wanted to talk to you about building resilient serverless applications. I think you have the right experience for this with what you do. But now that we're building serverless applications, and we're going beyond traditional applications as well as traditional microservices - if microservices can be considered traditional - you're starting to break things down into multiple functions. You obviously are using a lot of third-party services or managed services from the cloud provider. My question here to get us started is what is the main difference between a traditional application, whether server based or or container-based in microservices, and moving to this serverless environment?

Nitzan: Sure. I think the main difference is that a lot of the things are out of your control now, which is a good thing, because this is what you want when you go serverless. But on the other hand, you lose control over some of the things that are going on in your application. So when things don't go well, it can be very difficult to know where they broke. Then if you want to build something that's resilient, that's going to work in high scale, in very high reliability and without many surprises, you really have to think about all the different scenarios that can go wrong, which is not just my code had an exception. But maybe I got a timeout; I got an out of memory condition; I got a series of events that didn't go well - synchronous events, perhaps - and it seems that everything worked but actually didn't. How do I know about these problems, even if everything seems okay? The number of problems that can happen is just growing when you go serverless.

Jeremy: I think that makes a ton of sense. Why don't we dive into this and start talking about some of these individual problems or some of these differences, and maybe we can start with troubleshooting? What's different when you're troubleshooting a serverless application versus a more traditional, server-based application?

Nitzan: There are several key differences. The first one is that when you go serverless, you go distributed in a very significant way, more than with containers, for example, because those functions are kind of nanoservices. When you combine them together, we are seeing organizations with over 5,000 functions or more, which is just a very high number of nodes in the graph, if you look at it this way. It's very, very distributed. When something breaks, usually there are many more components involved in the chain of events, so it's going to be much more complicated to track what happened to find the cause of the problem. So distributed would be very important thing. 

The other thing is that the new things that can go wrong. All those time outs, all those out of memory conditions, they happen all the time and [it's] very, very difficult to predict them. It's not something people are used to when they work with traditional services. And finally, the possibilities that you have as an engineer or DevOps to understand what's going on in your application is again more limited because you have no access to the host, so you can't install agents and so on. All you get is basically the basic logs and metrics that the cloud providers give you, which makes it even more difficult to know what's going on in the application layer and not just the simple metrics, because they are usually not going to be enough to troubleshoot a complicated problem.

Jeremy: Yeah, and I think with something like Lambda, or any function as a service, these are ephemeral compute. You have mini execution environments, or containers spinning up in the background, but those go away. You can't go back and look at the logs and see what that server did. And really the only logs available to you that are dumped to CloudWatch, for example, those are only there if your application actually sends logs. It's not logged automatically.

Nitzan: That's exactly the challenge, because once bad things happen, usually you didn't think about them before, and then you don't have the information that you're looking for in the log. Then you also don't have anywhere to connect to, to investigate, because, as you mentioned, it's ephemeral. That makes things very difficult because you can't think about everything that can possibly happen and put it in the log. On the other hand, you really have nowhere to go to after the thing happens. So you don't really have anything to do, just by using the logs. This is basically the conclusion.

Jeremy: Also, if you're using a number of remote services or managed services from the provider, where does the debugger go there? How do you see the flow of information? You have a lot of events. You have these highly event-driven applications with information flying all over the place. How do you keep track of that? Where do you see those logs?

Nitzan: Generally you don't see it, and that's the big challenge. This is, of course, why we are building a tool to help to help you. Generally speaking, the events that are going through the system are usually much more meaningful than the logs, from what we saw. If you actually know the events and data that is flowing between different components, that's going to tell a very good story of what happened from the request until the problem that happened. [That] can really help you troubleshoot the things that these events are not going to be in the log unless you specifically wrote it in the long, but usually, this is not the case. Getting those events is something that can really help.

Jeremy: Let's move on to the things that can go wrong in a service environment? Obviously, if you're using compute like Lambda Functions or Microsoft Azure functions, you have your normal code execution errors. You're going to get a "can't connect to a resource" or "can't parse a string," and those will be logged and those will be available to you. But when you start dealing with distributed systems and you're thinking about connecting to SNS topics or SQS queues or other types of managed service, what are the things that can go wrong there, in distributed systems in general, and maybe more specifically in serverless environments.

Nitzan: The things that are more specific are that, in the past, usually you had one big monolithic application, and when something went wrong, it would produce an error and something written to the log, and that would be pretty much the story of what happened. Now, when you are talking event-driven and distributed, in many cases, it's very asynchronous. For example, one function can perform perfectly fine, produce a message to some SNS, and then another function will get this message a little bit later, and then it will fail, but everything seemed fine. So the problem is actually that the message was not in the right contract, for example, between the two services. It's very difficult to to see what went wrong, because if you look at each function, everything seems right. I mean, this one did something right; the other one failed as it should have, but why was the message like this? Because the two teams that wrote these two services didn't actually coordinate together. Suddenly you have another thing that can go wrong, which is actually the agreement between how do we communicate, [and] how do we transfer messages and events between services. These things were not issues in the past because it was all just functions calling other functions, and you're in the same binary process. So this is something very new.

Jeremy: The other thing you have that's different is this idea of the retry behavior. I think most people are familiar with synchronous invocation of a resource where you make a call, it does something and then you get a response back, and maybe that's an error, and then you can deal with it there. But now, as you move to serverless, you start dealing with things like asynchronous or stream based processing, and with asynchronous certainly, your code that calls that resource doesn't know what happens. It just gets a response back that says all right, I got your event, and then something happens down the line.  What's the impact of of this retry behavior on serverless applications and how we think about it?

Nitzan: One of the things I'm talking about in conferences, such as the [ServerlessDays] one we had in Boston and you invited me to, is the fact that these retries are something that [are] kind of considered as a good practice, by cloud provider to recover from error. For example, if the function fails, let's try to run it two more times and then see what happens. If it's an SMS message, we're going to run it two more time times as long as the message is new enough. That's something that is not really written in any programming book or software design book, but this is something that the architects of AWS thought would be a good idea. And it is a good idea sometimes, but for the developer, it can be very confusing. When it happens, usually it's very confusing because you just didn't know that this is the same invocation, running one or two more times, and when you have to think about it, it is very difficult to plan. This is where a concept such as idempotency come to action, when how are you supposed to write code that can run multiple times without having a bad effect or bad things happen. Eventually, it comes to the fact that people can't really plan an application that will be retried as many times as wanted with everything going right. It's basically kind of a constraint that you have to live with. You need to try and take it to your advantage when possible, but most of the time, I think many people would prefer to just go to do it the standard way. So don't try and run my code again without telling me, because I'm not sure what's gonna happen.

Jeremy: Yeah, and I think that the thing that's important that you mentioned about idempotency is that obviously if your transactions are getting tried or your events are being replayed multiple times by the cloud provider, your code has to deal with that. For certain transactions, it might not make that big of a deal. But if you are dealing with financial transactions, for example, you don't want those to retry or to submit the same, maybe, charge requests multiple times. If you think about the basics of the retry policies or how those work, the two times for an asynchronous Lambda event makes sense. If you're dealing with an SQS queue, then you have redrive policies that you can put in place so that the message will only be tried a few times, that way messages don't get stuck forever. But if we're thinking about the the redrive here, or maybe just the asynchronous invocation of a Lambda function, what do we do when that Lambda function gets tried three times and then it fails?

Nitzan: First of all, you need to know about it, which most people don't, because again, you have no indication, and the log is not going to tell you. So knowing the fact that something broke and it was retried is going to be very important when it actually happens. I think when you use a service, you need to know the properties and the limitation of that service. If you are writing a Lambda function that's triggered by a Kinesis stream, you need this Lambda to do string processing, because if it's doing something else with the data it's probably not the right thing. You need the Lambda to actually take the data, process it, and send it somewhere, then usually you wouldn't mind if it happens again. So I think it's possible to write microservices or nanoservices in Lambda functions or any other service for that matter that is contained enough so it will be able to handle retries. Then when you combine them together, in theory, it should work. The problem is that people just connect these services to each other without thinking, and they have hundreds of Lambdas, with many Kineses and SNS topics, and everything is running around, and it's not really working, as I suggested, of course, because people develop software fast. But if you had the time, you could actually plan every service to be working the right way.

Jeremy: If you're dealing with these failed events, obviously there's dead letter queues, as part of AWS at least, where you can put a dead letter queue or attach a dead letter queue to a Lambda function. If it's invoked asynchronously and it fails the three times, then that event goes into that queue there. And you can do the same thing with an SQS queue, for example, where if something fails after a certain number of times, your redrive policy will move that into a dead letter queue as well. Then, you have this issue where now you have dead letter queues or multiple queues with events living in them. You have to inspect those events. You have to set up alarms, so you know those events are in there, and then you potentially need a way to replay those events. So what about using something like Step Functions? What are the advantages of using a state machine or using something like AWS Step Functions?

Nitzan: Step Functions has several advantages. First of all, it has the advantage of being asynchronous, so you can actually have several functions - almost calling each other, but not really calling each other - but passing events asynchronously, so you wouldn't wait and pay for the accumulated running time for the functions. That would be one advantage. The second advantage is that it allows you to actually implement different rules and mechanisms in how the application is working that really it's a bit difficult to do without. You can say that if a certain event happened, only then you invoke this function or the other function, so eventually you don't really have to be coupled directly to the data. You can process it in different steps that will allow you to —  I think this is a good example of resiliency, because using Step Functions the right way, can really scale very nicely because every step in the step machine will generate an event for the next step. So you don't have to worry about everything at once. You can kind of split your application logic into smaller steps that each one of them is much more likely to succeed. On the design level, anyway, this is how I look at it. You can use something that helps you split your logic in a very accurate way. You just decide exactly what I wanna do in each step.

Jeremy: I love Step Functions because it gives you that ability to do function composition, like you said. When you start thinking about individual functions or single functions that do one thing well, making those all talk to one another and creating the choreography for that is sort of a difficult thing to do. You start to introduce Step Functions into the equation, and now you have a Step Function acting like a traditional monolithic application where it can call subroutines and aggregate that information together and then sort of do something with it. I think Step Functions are certainly a really interesting way to solve that function composition problem. They do get expensive, which is is something to think about, depending on how you're designing your application and what level of control you need. But let's move on to something that you're very, very familiar with which would be monitoring a serverless application. How do we go about doing that? How do we monitor a serverless application?

Nitzan: Yeah, sure. Different ways. You can do it in a simple way, and  in a more complex way, depends on the complexity of you application, of course. If you have just few functions - I would recommend using whatever AWS provides because it's already there. You have CloudWatch, so it will provide you with logs and metrics that will allow you to identify pretty quickly if something failed, and then go to the log and find out what happened. That, almost in 100% of the cases that we are seeing, is the first step. Then, the second step would be when you're going to a little more functions, and that I would say 20 or more, suddenly they start to get connected to each other. You start to create some kind of a distributed application. That's where the individual logs and metrics will not really tell the story, because they only provide information about individual components. Many people, at that point, will aggregate the log somewhere so they will just stream the logs into a log aggregation service, such as ELK or anything else. This would allow them to search in the logs and hopefully find problems faster. Then, eventually, you have a distributed application, and in order to really understand what's going on there, especially the more complex stuff, you need some kind of a distributed tracing technology. What is actually distributed tracing is basically to know how different services are sending messages to one another and how it's all connected from end to end. Some companies will implement some techniques of tracing in the logs, so you can have identifiers in the logs as we kind of go through. Then, you can search them in your log aggregation tool, so this would be probably the last step before using a dedicated solution for that. It can work pretty well, and we saw people do it in very high scale, with hundreds and thousands of functions. But at some point, there is also the question of how much time do you want to spend. It's going to take you a lot of time to implement different tools, especially based on logs. The whole point of serverless, of course, is developing fast. Eventually, if you're spending 30% or 50% of your developers time doing that, that's where we recommend considering an automated solution that will do as much of the work for you. Eventually, your hope as a developer, as a development manager, is that your developers will focus on building software that matters to your business and not building software that helps you monitor your business software. This would be when you get to a high scale; usually this is where people look for a solution.

Jeremy: You mentioned some of the tools that AWS has. So they have X-Ray that does some tracing; obviously, CloudWatch Logs does logging. Maybe explain the difference between those two things and why tracing is an important component.

Nitzan: Logging is pretty simple. It basically means usually logs is text. A log is a text file in some way or text data that is written either by the developer intentionally or produced automatically from some system that produces logs and then you get text. So textual data is very common. It's everywhere — but it's still text, so it's not structured. It's not even JSON. JSON, for example, is formatted. It's structured. You can say I want this field. I want this hierarchy. Logs are eventually going to be text-based files. Then using logs, you can do many things, right? Tracing is the way to, again, trace. What is to trace? Let's say you got an HTTP request. A trace will go through the lifetime of the requests, so we can go from one service to another service to the next service. That will be a trace that's going through my system and tells the story of what happened. The data of the trace is involved of what services, what data was transmitted, how it was transmitted, how much time did it take to transmit it from each point, and eventually the order of the events. This will be a trace that really can tell you what happened every point of the way, and you can put it on a timeline to identify bottleneck, to identify where you spend your time. One of the things that we are doing is actually we take the logs and we put them on the trace. For us, the log is just another type of data that, of course it's textual, but it's very useful. It's much more useful if it's in the right context, and on the right timeline. So you have five services. (25:26) You get trace data, you get log data, you get latency. Everything is like a story. So I would say the trace is structured and it's time-based. Log is textual data that somebody will have to structure in order to understand.

Jeremy: So how do frameworks like OpenTracing and OpenCensus help with all of this?

Nitzan: These frameworks are very useful as a standard way to write your tracing data. People said, "Okay, there is this concept called tracing. How can I standardize it so people won't have to invent it every time." OpenTracing, for example, will give you a standard way to create trace, to great spans, to create all those things that eventually tell you the story of what's going on in your system. Then you can implement, in your code, ways to send trace data in the Open Tracing format to some back and that will analyse this data, display it, provide information. These are just ways to standardize the traces and, for example, at Epsagon, we make sure that our traces are OpenTracing compatible. So if someone wants to add their own manual traces, they can easily do it without worrying about the format or, you know, people want to be compatible. Eventually, they always prefer to be compatible.

Jeremy: That makes a ton of sense. Let's talk about X-Ray again for a minute. [With] X-Ray, you go in, you instrument your code, then, as your functions run, it samples it, and you can see calls to databases or calls to other resources, and the latency involved there. But what about calls to an SQS queue and then the function that processes it and then that sends it somewhere else — that flow of data. Can X-Ray show you all that information?

Nitzan: You can do it to some extent. X-Ray will integrate pretty well with the AWS APIs inside the Lambda function, for example, and will tell you what kind of API calls you did. It's mostly for performance measurements, so you can understand how much time the DynamoDB putItem operation took or something of that sort. However, it doesn't try to go into the application layer and the data layer. So if information is passed from one function to another via an SNS message queue and then going into an S3, triggering another function - all this data layer is something that X-Ray doesn't look at because it's meant to measure performance. That's why it would not be able to connect asynchronous events going through multiple functions. Because again, this is not the tool's purpose. The purpose is to, again, measure performance and improve the performance of certain specific Lambda functions that you wanna optimize, for example.

Jeremy: You mentioned automation a few minutes ago, and I think that's a really important concept in terms of instrumenting your functions so that they do the proper tracing and logging. Obviously, in a more traditional application or a monolithic application, you might include your libraries and some of that stuff. But now we're talking about every single function needing to include this instrumentation. And that, in my opinion at least, is sort of a burden for developers to do that — but also pretty easy to forget. You know, say I gotta go back and add this, or maybe even it's a matter of which level of logging you've got switched on. What are some of the options for developers and for companies that want to create these policies to make sure that these are automatically instrumented? Obviously, there's Lambda Layers, which is a is a possibility. But what are some of the other options to auto-instrument functions so that the developers don't have to worry about it?

Nitzan: Yeah, by the way, it's not just worrying. It's not just the fact that you can forget. It's also just going to take you a certain amount of time - always - that you're going to basically waste instead of writing your own business software. Even if you do remember to do it every time, it's still going to take you some time. Some ways that can work [are] in embedded in your standard libraries that you work with. If you have a library that is commonly used to communicate between services, you want to embed that tracing information or extra information there, so it will always be there. This will kind of automate a lot of the work for you. That's just a matter of what type of tool do you use. If you use X-Ray you're still going have to do some kind of manual work. And it's fine, at first. The problem is when you suddenly grow from 100 functions to 1000 functions — that's where you're going to be probably a little bit annoyed or even lost, because it's going to be just a lot of work and doesn't seem like something that really scales. Anything manual doesn't really scale. This is why you use serverless, because you don't want to scale service manually.

Jeremy: And with Epsagon, you have a way to instrument the functions automatically, correct?

Nitzan: Yes, definitely. That's one of the things we do. We actually use Lambda layers, that you mentioned, that you can just do with probably less than a few minutes. You will be up and running with distributed maps of Epsagon, automatically, traced and produced, because we know how to add a layer to your functions through the Epsagon dashboard with one click. This layer goes and instruments the function. It produces all those events and traces from the code while it's running, and our backend can then identify how everything is connected. It's automatic in a way that, in many cases, you don't have to even change the code on your end. So that's very convenient for the developers.

Jeremy: Yeah, definitely, especially if you have 100 Lambda functions already written. You don't want to have to go back into every single one of those and add some new type of instrumentation. But anyway, Nitzan, thank you so much for being here. It is great that you are sharing all your knowledge with the serverless community, and Epsagon's doing a great job. If anybody wants to get in touch with you, how would they go about doing that?

Nitzan: Yeah, sure. First of all, my email is nitzan@epsagon.com. Very simple. I'm on Twitter. It's @NitzanShapira, my full name, and you can also just get in touch with me on LinkedIn. It's pretty easy. I'm very responsive. So, of course, you can check out the Epsagon website. I have a bunch of blog posts that I usually publish there as well.

Jeremy: Awesome. I will get all of that into the show notes. Thanks again, Nitzan.