Episode #43: The State of Serverless Report with Stephen Pinkerton and Darcy Rayner

April 6, 2020 • 57 minutes

In this episode, Jeremy chats with Stephen Pinkerton and Darcy Rayner about how organizations are adopting serverless, what enterprises like DataDog are doing with it, and what comes after serverless.

Watch this episode on YouTube:

About Stephen Pinkerton

Stephen Pinkerton is a Product Manager at Datadog. Stephen has also held roles in product strategy and software engineering, working with teams at Google's Nest Labs, Facebook, Cloudflare, Square, and Monzo to ship products, build distributed microservices, debug real-time embedded devices, develop features for modern frontend apps, and create data pipelines.

About Darcy Rayner

Darcy Rayner is a Software Engineer at Datadog, and previously worked as Lead Software Engineer at at Two Bulls, a boutique software development firm, running the front-end chapter. Darcy’s projects boast clients like Disney, PBS, LIFX, Verizon and the Linux Foundation.

Datadog’s Research Report: The State of Serverless: https://www.datadoghq.com/state-of-serverless/



Jeremy: Hi, everyone. I'm Jeremy Daly, and you're listening to Serverless Chats. This week, I'm chatting with Stephen Pinkerton and Darcy Rayner. Hi, Stephen and Darcy, thanks for joining me.

Stephen: Hey, how's it going? Thanks for having us on.

Darcy: Hi.

Jeremy: So Stephen, you are a product manager for Serverless at Datadog, so why don't you tell the listeners a little bit about what Datadog does and a little bit about your background.

Stephen: So Datadog is a company that lets you monitor all of your servers, all of your application performance, if your website's up or down, all your application logs in one place and it joins all these disparate data sources together, so that whenever you need to explain something or debug something, you can join all of this different data and really figure out root cause quickly. So I've been working here about a year, focusing on our serverless integration, so that's helping our customers running products like AWS Lambda, to be successful in deploying new services built on top of serverless and then debug issues with their applications.

Jeremy: Awesome. Darcy, you are a senior software engineer for the Serverless Team at Datadog, so why don't you tell us about your background and what your role is at Datadog.

Darcy: Sure. So I've been at Datadog about a year in the Serverless Team, before I joined Datadog, I was working at an agency. So we were massive serverless adopters in everything we did. Everything was about getting stuff off the ground running very quickly, and low cost to our customers. But while I was there, I realized that there was still a bit of a gap in terms of the monitoring story. So I joined Datadog about a year ago to work on some of the integrations that we're building here with services like Lambda or Azure Functions or GCP Functions.

Jeremy: Awesome. So a couple of weeks ago, Datadog came out with this very, very cool report called the State of Serverless. You basically looked at a bunch of your clients, went through and figured out how they were using serverless, broke it all down. This is really great, can you, maybe Stephen, can you give me some background on what was the reason for running this or for putting this report together?

Stephen: So we're in a unique position where customers of all different sizes, with all these different use cases are sending all their data to us, and we frequently get questions when we're on the phone with them of, "How do I run serverless successfully?" So this is from customers who are moving workloads into serverless, or they're 100% serverless and they're asking us, "Which metrics do I pay attention to, or how do I get data out of serverless, or how do I run this in a cost efficient way?" So we frequently get these questions, and the report was a way for us to look at data across all of our customers, across all these different data sources that we have and say, "Here's exactly how people are running on serverless." Which technologies are they using, what are they monitoring with it? So it was a really interesting opportunity for our customers to see how other people are using serverless really in a data driven way.

Jeremy: It's interesting, because you do mention in the beginning of the report that you're saying "Serverless," but you are just focusing on FaaS. So actually, I'd love to get your thoughts on this, Darcy. Just considering what serverless is as a whole, what do you consider that to be, because it's more than FaaS.

Darcy: In terms of the things we're looking at specifically in this report, it's very focused on Lambda. I think in general, we have people who come to us asking us for solutions for things like ECS, Fargate, things like Knative or Google Cloud Run, which they're not necessarily following the pair invocation model in terms of pricing and cost structure. And they're not necessarily building containerized services directly around functions, but they're adjacent. They have some of the pieces. The ability to very quickly on-demand spin up resources and the ability to have event driven architectures, like this is something I think we see across different solutions.

Jeremy: Nice. That makes sense. I want to get into the study itself, but I do think it's important, because I know someone who's tried to run a survey in the past, that the methodology is important. We want to know who the people are that are answering these, which way they might skew based on their population and so forth. So the report actually did a great job outlining this, but just because I'd like to go through these findings, it would be great if we could just talk about that methodology for a second. Let's start with the population, so this was just Datadog customers?

Stephen: Yeah. The claims that we make in the data that we looked at is across all of these Datadog customers. We don't have data on someone who's not a Datadog customer, so for all of this, we looked at our customers' metrics, their trace data.

Jeremy: It just seems, and your customers are obviously more cloud savvy. So we're not looking at all enterprises here, just the ones that are probably much more cloud savvy, using Datadog.

Stephen: Yeah, that's correct.

Jeremy: Great. Then we also talk about Lambda adoption in here, and that's one of those tough things too where, what does it mean to adopt Lambda? Can you explain what that means?

Darcy: So Lambda adoption, we consider it to be any account in AWS, which is running more than five Lambda functions a month. That was the cutoff point where it's like maybe they have one or two people experiment around with it, after five, we considered it being regularly run. We considered that to be a company that's adopting Lambda.

Jeremy: And that ties into the AWS usage, right? In order for a company to be using AWS, they would have to be running some workloads in that cloud?

Darcy: Yeah. Our broader definition of AWS usage included both anyone who's currently using Lambda, but also we looked at any organization that had more than five EC2 instances running in a given month.

Jeremy: So this is still covering fairly small customers too. Five EC2 instances is relatively low, so this gives a nice broad perspective of these small customers plus large customers. Then that brings us to scale of the environments, so how did you estimate the scale of the environments?

Stephen: When we talk about customers being small, medium or large, we look at the scale of the other infrastructure that they're running. So we might be looking at companies that just have five Lambda functions or five EC2 hosts, but the way that we talk about them in the report is based on how much other infrastructure do they have. So they might have a small Lambda footprint or they might have a small EC2 footprint, but they might have a very large footprint using containers, ECS, Fargate, et cetera.

Jeremy: Awesome. Let's jump into this, so the first finding in this report was that half of AWS users have adopted Lambda. So that means that of all the AWS customers you have, 50% or more, I think it's 53% or something like that, that they are using more than five Lambda functions.

Stephen: Yeah. I think this one was very surprising in that organizations are aware of Lambda, and more importantly, they're using it. So from a business perspective, we found this really surprising, because these leaders in companies are bringing in Lambda because it lets their teams move a lot faster, shipping products a lot faster. It also means from a development perspective, there's not a team you need to go to talk about who's going to monitor your code, or you don't need a request new servers, new hosts from your procurement group within your company. Lambda is just a really easy way to get started, so I think that's why we see it used across so many organizations.

Darcy: I think there is a lot less red tape in getting a Lambda function approved than say speeding up an EC2 instance. So even large traditional organizations, like banks or like financial institutions, there is some adoption that's happening there, just because it's a lot easier to get products off the ground running.

Jeremy: One of the points in the data is it shows that it's up from about 21% in 2018, so that's more than doubled in two years. Is this something that, the trendline looks like it's continuing to grow, so is this something where we think the vast, vast majority of customers are going to be using Lambda say by 2022?

Stephen: I think for a number of workloads and use cases, absolutely we're going to see this used everywhere. I think not just Lambda as well, just any type of these serverless products. The report is very focused on Lambda, we do get into what products are people using with Lambda, but people are starting to realize the value of a serverless database and serverless message queues, and the whole ecosystem is definitely here to stay, but the way that you're running your code might be changing.

Jeremy: Absolutely. Darcy, what types of use cases are you seeing with these Lambda functions?

Darcy: I think there's a pretty big variety. A lot of the companies dipping their toes I Lambda, they start very small. So it could be like IT is running some batch jobs every now and again on a Lambda, it's the perfect use case. To startups that are maybe migrating existing Django apps into a single function, or migrating a monolith, to startups that are entirely driven from serverless and everything's a function. So there's a massive variety and spectrum, and I think some of the details we have in the report show that. Large enterprises with large legacy systems are adopting it in slightly different ways to newer companies.

Jeremy: So let's talk about that, so that's the next finding here is that Lambda is more prevalent in large environments. So you kind of got at why you think that is, but is that just because of cloud sophistication you think?

Darcy: I definitely think it's a major factor. When you have a large organization with several teams, there is this broader movement towards microservices and having team ownership boundaries of services and giving engineers more autonomy. I think with that, you just have an increased likelihood of one, two, three, four teams adopting Lambda, and then that being the gateway in a large organization. Whereas, if you're a smaller company, maybe you're not operating at the same scale yet, you can end up with more unified technology, which means that there's less chance that Lambda will be adopted, even though more of these startups and new companies are adopting Lambda.

Jeremy: I wonder, Stephen, for you, do you think that education or the learning curve to get people started with serverless, this could be another thing? You've got a small organization, they can't just go off and do all these skunkworks projects like Darcy said. Do you think education in serverless or the learning curve is holding some of these smaller companies back?

Stephen: That's a good question. I think there's a lot of misconceptions around serverless that might hurt its adoption in some organizations, but it definitely requires a different way of thinking that not ever organization might be ready for. But I think what we saw, the trend several years ago was that it was engineers driving serverless, people are building their own projects off of it and seeing how cool it is and how fast it lets them move, and they're bringing it into their organizations. So I think education is a big part of it, both realizing that it can solve my use case, and I might be able to solve that problem faster than I could otherwise. I don't need to go spin up a bunch of docker containers and manage scaling and everything for different services that I want to run.

Jeremy: So speaking of containers, this is one of the findings that I absolutely love, the fact that container users have flocked to Lambda. Now clearly, they're not abandoning containers altogether, not that there's anything wrong with containers, we love containers, but the idea of being able to use Lambda functions to do some of the workloads and just make it so much easier, not even to worry about orchestration or any of that stuff. But it sounds like, based on these findings, 80% of users that are using containers, AWS users, have at least five Lambda functions running.

Stephen: Yeah. This was another really surprising finding, and like you talked about with cloud sophistication and what Darcy talks about with the popularization of microservices, that it's less important where you're actually running your code. So if you're already running in a microservice architecture, then it's very easy to adopt serverless and see maybe this is a service where we want more elastic scaling or our workloads are a little bit less predictable and were spiky, and that's a great place to run serverless functions. So we see people running these together sometimes to get the benefits of both, or just because different teams are using different technologies for the problems that they're solving.

Jeremy: I wonder, Darcy, I don't know if you know the data on this, but do you see a reduction in use of containers while people are migrating to Lambda functions? Or is it just a new subsegment of their architecture that's growing?

Darcy: I don't think we've seen any hard data to indicate that, I think it's probably more likely as organizations grow container adoption, we did have our containers report come out I think a couple months ago as well, so there's probably better data there, but that's also been growing. There's so many opportunities for organizations to migrate from maybe a traditional internally hosted infrastructure or to even older cloud infrastructures to containers and serverless, that they're both still growing.

Jeremy: I think the lift and shift is much easier when you're porting to containers, a lot of your application code doesn't have to change as much. Whereas with Lambda, you are completely re-engineering and refactoring how, not just the code works, but your whole entire architecture as well. I think that gets a little bit complex. So the next finding here is that Amazon SQS and DynamoDB pair really well with Lambda, so I think that is kind of obvious. You would think people who build serverless applications, you want to use tools that are serverless themselves or at least play really, really well with serverless tools. But that was interesting, because it seems like the pay-per-use stuff is very popular with Lambdas, and not so much with SQL databases, although you still see some people trying to hit MySQL with them as well.

Darcy: Yeah. I think this is something that drew a little bit more to the legacy of Lambda. We've seen newer services like some of the Managed RDS stuff coming from Amazon, which is really promising, but there were historical reasons or historical technical reasons why using DynamoDB might have been the easier less resistance solution, versus using RDS. I think that's changing with a lot of the services that AWS has been introducing lately. Definitely adding the Managed Aurora stuff is definitely a big step up from that. So I think we might see that change in the future, there's certainly a lot of advantages to SQL or relational databases that are more eventually consistent databases like DynamoDB. Looking at SQS versus something like Kinesis streams, again, it's just a lot easier to set up and adopt. Engineers tend to prefer simpler integrations and the integrations that AWS has with SQS are a lot quicker to get up and running, versus something like Kinesis, which is more of a workforce solution.

Jeremy: I didn't see Managed Kafka showing up as one of the queues that people were using with Lambda.

Darcy: Definitely there are a lot of orgs using Managed Kafka, but there is another level of overhead. Where if you are looking for a really simple managed solution, it's not necessarily the best way to go.

Jeremy: You mentioned about the services that AWS is offering with Managed Aurora or Aurora Serverless and things like that, and I do think that, I wish there weren't as many of those solutions. I wish people were forced more to go with the more serverless type things. Because Serverless Aurora or Aurora Serverless isn't quite serverless in the sense. There is some scaling that's still required there, and you still have a connection management and all those other things. I know you've got the Data API and some of those things that can really help with it, but I think overall trying to push people towards things like DynamoDB for operational stuff ... Now granted, I get it, you're going to do your transactional stuff or you've got to do your analytics reporting and things like that. But I'm curious in terms of whether or not, because it looks like the number of data storers that are mixing with that SQL, generically SQL I guess is fairly low, with DynamoDB being much higher. Are you seeing companies moving away from SQL and to DynamoDB, or is it just one of those things where it's like whatever people are comfortable with, that's what they're going with?

Darcy: I definitely think there is a comfort level. People tend to ... Different databases for different solutions. Dynamo is not necessarily great for things like ad hoc queries or things with complex table joints, so there is a level of technical trade off that engineers make. I don't think you're ever going to see a case of DynamoDB entirely replacing relational databases. I think relational databases have a ton of uses. If anything, I think we'll just see the story of using relational databases and having managed relational databases become easier and easier, and more of the scaling overhead being taken away from engineers.

Jeremy: The other thing I'm curious about too is, especially seeing that SQS and Kinesis and SNS are so popular as Lambda triggering those that are then triggering the data sources. It seems like a lot of your customers are starting to embrace that idea of asynchronous thinking. Is that something you feel you're seeing as well?

Darcy: Yeah, definitely. I think there is this event driven microservices revolution that's happening. It does take a long time for large organizations to really buy into that idea. If you've been running a monolith successfully with 50 to 100 engineers for the last five years, then it's harder to have that organizational buy in. It takes time. But I think that is a growing trend, and I think the different architecture patterns that people are employing around distributed queue-ing and event driven and relying on very redundant and ephemeral instances for everything. Whether that's in containerization or in Lambda, I think that's growing as one of the most popular web architectures.

Jeremy: Definitely. So the next one, which I thought this was, I think it makes sense for Lambda functions to be written in Node and Python, because the cold start is a little bit lower and so forth. But it's interesting, because with the number of large clients that seem to be adopting it faster, which I would expect to be developing apps in Java or something like that, as a more popular language, Python and Node.js in a landslide are the most popular.

Darcy: Yeah. I think it makes a lot of sense to me. Things with compiled languages, like Java is, even though it's running in a VM, it is compiled, so things like Go and Java, there is more overhead in setting that up and deploying that. You can't really use, and I'm sure a lot of Lambda usage, or a surprising amount of Lambda usage isn't people with big infrastructures, code deployments, it's somebody copy and pasting a Lambda function directly into the AWS console. I think a lot of the adoption probably just happens from that, it's a lot easier to copy and paste code, adjust it in the console. Then for people who are using, a lot of the orgs using actual infrastructures code and deploying in more like a rigorous way, it's still a lot easier to control and deploy Lambdas for those languages.

Jeremy: I think too, those languages are, because they're not compiled, because you can launch them so quickly, and I think the cold start time was another thing that was hugely popular. But I don't know, Stephen, maybe you can shed some light on this too. Just thinking about which engineers might be using this. Is it that your main, your hardcore app engineers who are coding the main infrastructure or whatever, the main app are doing it in Java or something like that? Whereas some of these other side use cases, maybe ETL tasks, maybe some data transformations, maybe some DevOps things are just quick and dirty Lambda functions that are written in these languages by DevOps engineers?

Stephen: Yeah, absolutely. So we really see the use cases all over the place here, but for actual run times that we see used a lot, for a background job it's very common that we'll see it written in Python and Node. But I think what's been really surprising is that even at these really large organizations who are adopting Lambda that we talked about before, they're still using Node and Python a lot, which I think would surprise some folks that they're really at the cutting edge of both how they're running their code and the way that they're actually writing it. We do see some more Java use on the enterprise side, of people moving these Java microservices into Lambda functions. I think that fits their use case pretty well. But in general, we do just see the most in Python and Node everywhere.

Darcy: I think the other maybe interesting thing is you would think Ruby would fit into that paradigm of dynamic language, but we're not seeing a lot of Ruby adoption in Lambda, and we're not really sure what the reasons why. I think that probably would be the only outlier, whereas Ruby was traditionally one of the more popular languages that was being monitored by Datadog as a whole.

Stephen: It could even be skewed by people have these Rails monoliths and if they lift and shift that into Lambda, they just have less Lambda functions.

Jeremy: That's also true. But that actually could be, that leads in I think to the next finding, which could also be based on the use cases. Is the fact that the median Lambda function runs for 800 milliseconds. Half of them run for less than 800 milliseconds, which could be front end, maybe synchronous calls from an API Gateway or something like that. But those longer running ones, that sounds like there are other tasks that those are performing.

Darcy: Yeah. Some people have probably decided to use Lambda in a way that is actually like running more computationally heavy workloads. Things like web scrapers or like jobs that are running continuously over a period of time. We anticipate some of that is coming down to, I don't want to say misusage, but trying to convert a workload, like a square workload into a circular hole.

Jeremy: You can say misusage, because I think that's right.

Darcy: It might be the case that some workloads aren't necessarily entirely appropriate for services like Lambda, and containerization and something like ECS or Fargate would be more appropriate. But there are some cases where people are just making it work I think. For the lower use cases, I think the majority of events that we see are probably like HTTP events, so low latency is really, really important for an API, which is probably why we see such a heavy skew towards fast implications.

Jeremy: The other part of that too was that basically it says one-fifth of Lambda functions run for 100 milliseconds or less, which is interesting, because of course that is the billing threshold, the unit of billing for AWS. I know there have been some calls, including from myself, to get that granularity down maybe 50 milliseconds as opposed to 100 milliseconds. But that's interesting, because again, you can't run much in 100 milliseconds, unless it's something like responding to an API Gateway for example.

Darcy: Yeah. I think with a lot of these asynchronous patterns that exist in Lambda, you can push something to be written off to a database, a more eventually consistent way in that amount of time.

Jeremy: That's true.

Darcy: So there are patterns that exist that really reduce the amount of latency that can go onto a web request and give you a very fast average response time. A lot of Lambdas are doing just processing work, they're transforming output from a queue, and then that's going off to the next queue. So there are some mixed use cases I think.

Jeremy: Yeah, definitely. The next one is half of Lambda functions have the minimum memory allocation, and speaking of misuse, this is probably one of those.

Darcy: I think the memory allocation, it probably comes more down to miseducation, although if your Lambda's executing with 100 milliseconds, maybe it makes sense to leave it. I think people don't spend a lot of time thinking about how to optimize these services, we've removed so much of the thinking about overhead and infrastructure that developers are just putting things in Lambda and not even spending the time to tweak it and reduce latency, and potentially cut costs.

Jeremy: Stephen, do you think that is just something where we need to develop better best practices around?

Stephen: Absolutely. This is a question that we hear a lot from folks about, "How do I optimize my Lambda workloads?" I think just because Lambda and serverless has let us write code, upload it to a cloud provider and let someone else run it, there's still knobs that you can turn to adjust performance. I think there's a lot of miseducation or complete lack of education around what do these knobs mean. So we see this as well when customers run into issues with concurrency limits, memory's another big one, and one of the latest ones has been provision concurrency, that's both how do these work and how do I use it affectively for my application and to be most efficient with cost.

Jeremy: So another best practice maybe is setting good timeouts, because you certainly don't want ... With Lambda functions, you're paying while they're processing. So if you have something that hangs for some reason, if you expect it to end or the processing should end within 10 seconds, and you set it for 10 minutes, and it just keeps on running and running, then obviously you're paying for something you don't need to. So the report points out that two-thirds of defined timeouts are under a minute, which is probably good, there's probably more granularity in there. But I think the scary thing was a bunch of them were set for 15 minutes, the maximum.

Darcy: I can see why some developers would choose to do that. Their reasoning is maybe this is an intense job and we can't afford for it to fail. But the truth is that any piece of infrastructure you build, you should have the ability to recover and retry for the sake of scalability. I think it's maybe being used as a crutch by some developers or engineers to try and guarantee something's going to finish, without spending the time to think about how to engineer their workloads to run quickly.

Jeremy: I tend to lean more towards the crutch side of things, because I think that it's just like, "I can let it run for 15 minutes, I'll let it run for 15 minutes." But have you or has your team, and Stephen maybe because you're closer to the customers, you could see this, but have people had concerns over this idea of the denial of wallet, this idea of flooding Lambda or API Gateway so that your Lambda functions are just running, is that something that you hear about a lot?

Stephen: We hear some security concerns occasionally similar to this, and I think they're split between timeouts and concurrency limits, where they're both things that can starve resources in your account. So if you have a high timeout, and you get flooded with requests, you're going to have a Lambda function executing for a long time. It's just going to cost you money unnecessarily. On the concurrency limit side, we see this as an issue where if you have unbounded concurrency and your function gets too many indications, then it can actually starve the concurrency in that region of your AWS account, and actually cause your other functions to not run. So I think that's a common gotcha that we see, so we typically recommend that customers set timeout limits and when appropriate, concurrency limits just to prevent these kinds of attacks.

Jeremy: Actually, the concurrency limit was the last point here, or the last finding that only 4% of functions have a defined concurrency limit. I think for a lot of internal communications, maybe if you're using it for DevOps or different use cases like that then having a concurrency on it probably isn't that big of a deal. But certainly some of these ones that are forward facing or processing off of queues, or anything where you could just flood these things, that seems a little bit scary to me. Although, it does say that 88.6% of organizations have at least one function with the concurrency limit defined.

Darcy: Not 100% sure why people aren't setting them. I think there is this promise of serverless, of it just scales, and you have a sudden demand or peak in usage, it just scales. This might be coming down to engineering teams not asking themselves questions, or not really planning the maximum capacity to their systems. Lambdas that are just running Chrome jobs in the background, they don't necessarily need that kind of thought put into them. I think organizations that do use concurrency limits generally are thinking about scaling, they're thinking about scaling in a much more granular way. Again, that account wide concurrency limit across every function is something that's very easy to get bitten by early on when you're dipping your toes in Lambda, and getting a bit of usage.

Jeremy: Then maybe this is an education thing too, because I've been saying this for quite some time where your average developer usually didn't know anything about the infrastructure. They're writing code and you have an ops team or you have some, you eventually get to the DevOps culture where working together to make sure that the code you wrote had the right infrastructure behind it. But when it comes to things like Lambda functions, there's just a lot of questions that you probably never had to ask yourself as a developer before. As you said, as teams become more agile and they're able to just publish directly to a Lambda function, as opposed to have somebody set up a Kubernetes Cluster for them with all kinds of defined rules. I guess your average developer's probably not thinking about this.

Darcy: Yeah, absolutely. We've built a really, really powerful abstraction, but it is an abstraction and there are leaks. Concurrency is absolutely one of those things, how the process of Lambda works leaks a bit, how the language and the runtime works within that context leaks a bit. If you just take the mentality of, "It will just run my code as much as I need it to," then you're probably not thinking about it quite enough.

Jeremy: Because the other thing too is, as you mentioned, where this promise of serverless is we'll just scale, we'll just keep scaling and scaling, and nothing is infinitely scalable. Everything has limits at some point. Obviously, the per region concurrency limit is an artificial limit that AWS puts in, you can increase that. So if you have 10,000 concurrent requests, you can have them increase that for you. But I think your average person probably doesn't think about that, at some point is too much scale not good? Do we want to limit scale because of downstream systems, because of billing concerns, because of all of these other things.

Darcy: Yeah. When we talk about event driven architectures, it's the same as any infrastructure, there's bottlenecks in the pipes and the connections between different services. It's very easy to create downstream pressure. I think the more people buy into the serverless promise, which is every piece of your infrastructure is meant to be something that can scale automatically for you, like Dynamo, like SQS, the more easy it is to miss the finer print on how your system could fail. It's why monitoring is still extremely important in these environments, you can't really skip that, because you could find yourself in a situation when your services start failing, even though you've got all the settings to say, "Scale to a million. Scale to a billion," or whatever. One small thing might not be able to keep a promise, and suddenly you have a failure somewhere in your system.

Jeremy: Definitely. Distributed systems in and of themselves are very difficult, and now when you start talking about all of these little components, all of these little building blocks with serverless, and you've got Lambda functions and queues and databases and streams and all that stuff that's communicating with one another, being able to understand where those failures are is a hugely important thing.

Darcy: Yeah, absolutely.

Jeremy: Awesome. So those were the findings in this report, and if you haven't seen this report yet, DatadogHQ.com/state-of-serverless. I'll put it in the show notes as well. But while I've got you two here, you work with a lot of enterprises, it's great to get some insight into what other people are doing. You're an enterprise yourself, or you work for an enterprise yourself, so I'd love to start and maybe just get a little bit of context from you. How did Datadog get started with serverless? Actually, how serverless are you actually? That would probably be a good question.

Darcy: That is a good question. We're about a 10 year old company I think, it's either 2009 or 2010, that we started. So we were pretty early as part of this whole DevOps revolution, I think that was part of our mission statement was to be a part of that movement. But our infrastructure was built on the EC2 instances, it was glued together with all the things you would do in 2010 cloud architecture. We've scaled, so we brought in a bunch of different teams and products and different services, acquisitions from different companies that have been brought into the fold. So like happens, we've had internally a lot of adoption to move to a more microservice based approach, and having teams own individual services which they maintain and deploy and monitor. So that's been a big trend internally. We have a lot of Kubernetes adoption and we have serverless adoption as well for a lot of our teams. On the Serverless Team, surprisingly our compute loads don't run on serverless, because we use the same infrastructure as a lot of Datadog has historically. But when we went out and started talking to people in the company and finding places that serverless was being adopted, we found it was actually adopted pretty widely in the company in very different and diverse use cases.

Darcy: We have a lot of internal tools and products built on top of Lambda, a lot of tooling around. CICD is using Lambda, Slackbots, in some production workloads, Lambda is a piece of that as well. It was really surprising just to find the different sort of use cases that existed within our company. We have a commitment funnily enough to be a vendor agnostic infrastructure, so a lot of what we do and build is run on Kubernetes. But a lot of the glue pieces that we have to run our business are running on services like Lambda.

Jeremy: Awesome. Stephen, is that something too, again, talking to your customers, and obviously you've seen this serverless adoption with the report here, but it seems like no matter what you're using out there, serverless is probably a part of it somewhere.

Stephen: Yeah, absolutely. That's something that we took away from the report as well, just the customers that we talk to every day, that serverless is all around us and that things that we use, applications, websites that we interact with every day are using serverless somehow. It's not even just in internal tools, it's your banking application, the way you buy movie tickets, the way you sell something on the internet. In some way, all these different companies are using serverless, are using Lambda workloads or they're using some sort of serverless technology, either with serverless containers, databases, on other cloud providers. It's really everywhere, and I think the adoption is so much higher than any of us expected.

Jeremy: That's awesome. Another thing that always seems to be a popular topic of conversation, especially with companies like yours that have to work with customers that are probably faced with this internal battle, is this idea of multi-cloud. Again, there's a million different definitions of it, multi-cloud versus being cloud agnostic for example are probably two different things. But just I'd love to get your take, Darcy, on this multi-cloud thing, and what you're seeing from your customers and what you need to support I guess, from a serverless standpoint.

Darcy: We definitely see a lot of very large enterprises that are multi-cloud, and sometimes it just comes down to which team within that company is building something. You have large multinational corporations that tend to go, or be more likely to go multi-cloud, because they might have been acquiring a company here, or bringing a different provider in here. I think I haven't seen personally many examples of companies choosing multi-cloud as their primary architecture. I think you do have people mixing and matching software service solutions, maybe outside of their primary cloud platform. It might be pulling an Auth0 or another managed service on top of that, but I haven't seen too many examples of companies primarily splitting their infrastructure service by service on different clouds.

I do think you do see bridging technologies, so there are some of these providers that do multi-cloud, but in an abstracted way. So you have code that you want invoked, and maybe they'll find a way to bundle up the Lambda or Azure Functions and GCP Functions, and distributed across different cloud providers that way. We do see maybe mixed adoption between something like Cloudflare workers for web traffic specific flows versus running the majority of your infrastructure in AWS. That's pretty common. Then there's really exciting things with Knative and Google Cloud Run for instance, where trying to think about how to build serverless applications in a platform agnostically, which I think is going to be very cool in the future.

Jeremy: Again, platform agnostic would maybe be a great dream if all of these platforms weren't doing things differently, because to be serverless on AWS probably means something, or being serverless on AWS means something completely different than being serverless on Azure for example. There's a lot of overlap functions and service, things like that, but just the different services available, it seems like there's so much different that at least for quite some time, until there's some standardization, that it just doesn't make sense that cloud agnostic is going to be a thing.

Darcy: Yeah. I think with something like Cloud Run, it wouldn't tick all the boxes for how you define serverless with ... Like Knative probably wouldn't tick all the boxes for how you define serverless compared to Lambda. So if you're running a Knative workload, that's not ticking the box of paying the invocation model. You're managing a cluster still, you are building functions, and you are doing event driven workloads. Versus something like Google Cloud Run, where you do have more of the invocation model. So I think there is a definitely mix and match, companies tend to buy really heavily into one platform's set of conventions and services, and I think unless you have a high priority on having multi-cloud availability, that's generally the way companies that I've seen would choose to go.

Jeremy: Stephen, I'm curious too again, being close to the customers, in terms of how people are approaching to build serverless applications, it's more than just choosing technologies obviously, it's avoiding choosing the lowest common denominator, trying to choose services that are scalable, that are easy to plug in. But are you seeing the serverless mindset or the serverless first shift?

Stephen: Absolutely. Obviously, that's biased by the customers who I talk to, but we see these customers who come in and are adopting serverless for all new products that they build, or they're moving existing infrastructure over to it, or they've even decided when they started as a company that they were going to be 100% serverless. So we're seeing this mindset really increase more, and we're seeing it mean more than just functions as a service, it's really how are we storing data. So for example, we see a lot of these customers using technologies like AppSync, so they're really at the cutting edge of using GraphQL, data stores, these event driven architectures. We typically see all of this together.

Jeremy: Awesome. I want to ask you one more question, because I'd love to get your thoughts on this and what you're seeing from adoption from your customers. I know it wasn't in the report, but something, people always ask, "What's next? What's after serverless, what comes next?" And I tend to believe, and I think other people agree with this, that it's edge computing. That's the next place where you're going to see a shift. So this idea of not having to manage infrastructure is great, it's going to be even better when your execution environment is no further than two miles away from the customer that's trying to access it, because it's just in every pot throughout the world. What are your thoughts? You did mention Cloudflare workers, Darcy, but what about Lambda@Edge and things like that? Are you seeing that adoption?

Darcy: Yeah, I think Lambda@Edge is becoming pretty popular. It's primarily just used for use cases around, most of the time serving web traffic and things like adjusting HTTP headers, or sometimes if you're serving a static site, you might stick a CloudFront plus Lambda@Edge in front of that. And Lambda@Edge is a great way to do things like AB testing or create, customize a static site in a way for your users. I think there will be a move past this concept of regions, because something like Lambda@Edge doesn't really fit into the idea of regions in AWS very neatly. So I think the idea of infrastructure and actual heavy workloads that are maybe still talking to your databases and still distributed geographically is going to be more common. At the moment, it's still tricky to do a multi-region infrastructure, you're still going to be limited. You're still going to be making geographical trade offs, like how you reach your customers and how you store data in a way that there's no proximity to reduce that latency. So I think there is an evolution to be had there, but we're only at the tip of that I think.

Jeremy: I think the data piece of it is probably going to be the hardest. With Cloudflare workers, they've got the global KV store, which is great, but still, what data do you need to replicate to certain regions. I think making the serverless jump is hard enough for a lot of developers, nevermind ... Not only are you running concurrent functions, you're now running concurrent functions at 180 points of presence across the world or something like that, and you have to manage the data separately, and your app has to be smart enough to do it, or the system has to be smart enough to do it. I think that's crazy.

Darcy: Developers are struggling to join data across multiple tables, how do they do that across multiple regions?

Jeremy: That's right. I can see that getting very complex. Listen, thank you both for being here and going through this report. That was awesome. A lot of great information I think, and like I said, if people want to go and check out the report itself or check out Datadog, they can do that at DatadogHQ.com. So why don't we just, if people do want to contact either of you, I know, Stephen, you're on Twitter?

Stephen: Yeah, I'm @SPNKTN on Twitter.

Jeremy: And then the Datadog blog is just DatadogHQ.com/blog. I think Darcy, you're in the shadows on Twitter, you don't do much of that?

Darcy: No, I'm not a massive Twitter user.

Jeremy: You're too busy building stuff to support serverless, so that's great. So again, thank you both, I will get all this information into the show notes, and it was great talking to you.

Darcy: Yeah, great talking to you.

Stephen: Yeah, thank you for having us.

THIS EPISODE IS SPONSORED BY: Stackery & AWS (Amazon EventBridge Learning Path)