Episode #65: Serverless Transformation at AWS with Holly Mesrobian

September 7, 2020 • 37 minutes

On this episode, Jeremy chats with Holly Mesrobian about why teams at AWS are adopting serverless, how Lambda is helping to make other AWS services better by pushing their limits, how to correctly isolate your serverless microservices, and what's next for Lambda and serverless at AWS.

Watch this episode on YouTube:

About Holly Mesrobian

Holly Mesrobian is a Board Member at Cascade Public and the Director of Engineering for AWS Lambda. Holly has 25 years of experience in designing, building, and managing globally distributed teams in software development, and more than 15 years as a leader of leaders. She has in-depth experience with building services for builders, and for wireless and broadband carriers; online services for direct to consumer offerings; and commercial shrink-wrapped software. With a double Master’s Degree in Computer and Science and Software Engineering, Holly began her career as a developer before holding leadership positions at companies, like Amazon and RealNetworks, and startup Cozi.

LinkedIn: https://www.linkedin.com/in/holly-mesrobian-a1b710/
Under the Hood of AWS Lambda 2019: https://www.youtube.com/watch?v=xmacMfbrG28
Under the Hood of AWS Lambda 2018: https://www.youtube.com/watch?v=QdzV04T_kec

Watch this episode on YouTube: https://youtu.be/nBYUh7CVUiQ

Transcript

Jeremy: Hi everyone. I'm Jeremy Daly and this is Serverless Chats. Today, I'm chatting with Holly Mesorbian. Hey Holly. Thanks for joining me.

Holly: Hi, thank you for inviting me.

Jeremy: You are the director of engineering for AWS Lambda at Amazon Web Services. Why don't you tell the listeners a bit about your background and what the director of engineering for AWS Lambda does?

Holly: Absolutely. Engineering leaders in Amazon are very technical and I think I fit in that class of leader. I've been in engineering for more than 25 years. The first decade that I was in, I was actually an engineer, and then the last 15 years or so, I've been leading large-scale engineering organizations that are also responsible for 24/7 operations. You think about those, they're called DevOps organizations. That's what I've been doing for quite a while now. The Lambda engineering organization is just like that. In terms of my background and how did I get here?

I have two graduate degrees, computer science and software engineering, and as I referenced lots of time, designing and building systems. One of the things that's really great about AWS and leading the teams here, I reference that DevOps culture. My teams, they build it, they run it and they have really great best practices around engineering excellence and operational efficiency. If we have an issue in one of our production environments, my teams are on it, and we have great processes around how we do that. We have a really well established COE.

Anytime there's a customer-impacting issue that happens in one of our production environments, my team's right. COE, which it means correction of errors, we review it as an engineering team every week. I sit down with my teams, we go through operational dashboards, we inspect our metrics. We look at how we're doing across the availability latency scale. We have ongoing scaling targets and scale testing. We're constantly inspecting how are we running the service? Not just how we're building it and how we're building new features, but how we're running it.

We run game days as well, so that we try to break our systems and then see that my team, all my on-calls can recover those systems. One of the things that I really like is we put new people in the team on those game days, because where better for them to learn than we've intentionally broken the system. Get in there and figure out if you can fix it before it's actually fixing something in production. That's really great about Amazon.

Then I would say the other great thing about Amazon and Amazon engineering and the teams that I have, I just love what a high caliber they are and how invested the members of the team are, and how hard they will work to try to make the best service for our customers.

Jeremy: Awesome. Well, listen, I am a huge fan of AWS Lambda and I love what you do. I love what your team is doing. Everything that Amazon is doing for serverless is just amazing. One of the things though that I'd love to talk to you about today, and we could get into the specifics of Lambda itself and how everything works, but you have a couple of great talks. You and Marc Brooker did these talks at re:Invent in 2018 and 2019, getting into the details of Lambda, Lambda Under the Hood, right? Great talks`.

If anybody wants to know exactly how Lambda functions work and how all that stuff works under the hood, definitely go check those out. I will put those in the show notes. What I'd really like to talk to you about today is just this idea of serverless adoption or serverless transformation, because I know AWS talks a lot about how all their internal tools are going serverless, right? Which is pretty cool. Of course there's the external stuff too. There's a lot of adoption from enterprises and small businesses and medium-sized businesses and things like that.

I would love to know the mindset internally. How does AWS take serverless or look at serverless and look at Lambda and use that to build their internal processes? What's the learning on that? How do you keep learning and just keep building with serverless?

Holly: Yeah. This is a really fun topic for me to talk about, and as you might imagine, customers find value in the agility and the operational load or the lighter load on operations that serverless brings. My teams are no different nor our AWS teams or Amazon teams. What we have seen over time is teams across AWS adopt and use serverless. Then my own teams over time have also adopted the serverless architecture and they actually want to use it.

Over time, more and more of the Lambda service, in particular on the control plane, because you don't want circular dependencies in your architecture. So we're really careful about making sure that in early design, when we're saying, "Hey, my team wants to use Lambda, is it okay to use Lambda and serverless?" Because it's building serverless underneath serverless and you have to be careful that you're not doing a bad thing. We're really good about inspecting that in the early design phases.

I've seen more and more of my teams picking up and building control planes on Lambda. In particular, they're using the feature that we launched last year at re:Invent called Provisioned Concurrency. What that does for really high-scale, low-latency services is it gets rid of what people have typically talked about, which is cold starts. Of course, we've done a ton of work over the years to reduce cold starts, but they're still not zero.

We're going to continue to do work on cold starts, but for customers who are super latency-sensitive and need that scale and know that they have low latency all the time, Provisioned Concurrency is a great solution. We have used it within our own services as well.

Jeremy: Right. Is that something now where all AWS teams, when they're thinking about building a new service, that they're going to build that on top of Lambda and do that serverlessly?

Holly: Yeah. One of the ways that people look at it is it's that operational model and where are you sitting in that? Of course Lambda's pretty high. We do a lot more of the shared responsibility on behalf of customers, and so teams like that and they say, "Oh, well, this is going to be easier to operate. We're going to get more agility out of it, so let's go there as a first stop." It's only when they say, "Well, maybe this isn't going to work for us."

That they go to the next potential option or an option after that. Like I was talking about earlier, I see my own teams doing that same evaluation and we're increasingly using Lambda to build portions of our service.

Jeremy: Right. Right. Yeah. Another thing ... And I think that maybe we don't always think about, or maybe people don't always connect these dots, but you can't run serverless in a vacuum, right? You can't say, "Hey, I'm just going to build everything on Lambda, or I'm just going to build everything on DynamoDB." You have to talk and interact with a number of different services in order to make that happen. You think about some of the recent launches, so V2N, Provisioned Concurrency, EFS for Lambda.

These are services that Lambda has to use in order to handle some of these use cases, and because Lambda really pushes the boundaries of these services, you end up making these services better, right?

Holly: Absolutely. To your point, Lambda stitches together a ton of AWS services. I think about it as Lambda is a lot of the glue between AWS services. In a number of the features that we've built, and you referenced V2N and EFS, both of those services, we worked very closely with the teams. You can think about them as joint projects, like at a project team level where we're in the room with the leadership from both of those teams every week, talking about any issues that we're seeing or how the project is progressing.

In the process, we make those products better and the products become better products. One great example is on EFS. Because of Lambda's unique performance characteristics, the scale, the instantaneous burst, we drove EFS to deliver higher burst capability from 1K to 25K, and so the product becomes better for everyone, not just Lambda customers or serverless customers, but all customers in the process of doing that.

We also do lots of joint collaboration and work to make sure that those services are operating at that level as well. We spend a lot of time in the development cycle, ensuring that the products worked really well together.

Jeremy: Right. That's one thing that I'm curious about in terms of like, what is the serverless vision for AWS, right? Or at least from your perspective from the Lambda team, with all of these new launches, all of these things that have come out, I mean, this has solved a bunch of new use cases or have opened up a bunch of new use cases, right? I mean, with EFS, you get the ability to do maybe ML, for example. You've got the Sam CLI that just went GA. You got RDS Proxy that just went GA to help solve the connection pooling issue.

You've got Savings Plans, you've got Provisioned Concurrency, all of these things we mentioned. Is that something where you're pushing or Lambda is pushing the other teams to help you solve these use cases so that more internal teams as well as customers can start using serverless?

Holly: Absolutely. In Lambda, as you referenced, we have continued to work to drive increased adoption and remove barriers for specific use cases. Rolling back all the way to a couple of years ago, we launched Firecracker, which made our service faster and helped reduce cold starts. We then launched V2N which brought that capability and lower latency to networking or VPC networking. Then we launched Provisioned Concurrency because we were hearing from customers that they needed that low latency all the time.

You roll forward. We just launched in June, EFS. EFS is really designed to help make sure that customers who haven't been able to bring their really big data-intensive workloads that they've been wanting to bring to the simplicity of Lambda to Lambda. If you think about it EFS the workloads, like bring a model and run something on that model, or bring big data and do big data transformations that you can do this really simply with Lambda. The data's there and you can do it when you want to do it.

You're not holding a bunch of capacity to do this highly scaled, highly parallel data processing. Lambda is great for that and that's really why we built EFS.

Jeremy: Yeah. No. I love the idea of EFS because it does, it opens up so many more use cases. That was the common complaint with serverless was always like, "Well, you can't do machine learning with it," or something like that. This is just one of those things where it gets us much closer to being able to do those sort of thing. All right. Another thing that is a launch or a feature that I think is absolutely amazing and I think it was last year at re:Invent, was Lambda destinations.

I love, love, love, love this feature because I have so many workloads where you have some background processing happening and when the process has finished, you want to tell somebody or tell something that that process has completed. Rather than putting all of that code in there and having to call a separate service and do all that other work, it's just so much easier for you to say, "Oh, when this is done and this completes successfully, fire something off to EventBridge or to SNS or put something in a queue."

Or, if there's an error, you get all that extra context information. I really, really love this service. I would also love if maybe there was a synchronous version of it as well so I didn't have to write code maybe at the end of a function. I could send some data off somewhere else too. Maybe that's a different discussion. I think what this opens up is this idea of ... And maybe I should say, maybe confuses people a little bit, is this idea of function composition.

This is when we want to have one function end and send information to another function and so forth. Obviously there's two different ways to do this. We can do choreography. We could use something like EventBridge and coordinate them or SNS and coordinate the results of functions. We could also use orchestration and use a state machine like Step Functions. I love that you now have all these different options and I love what you can do with Lambda destinations.

What are the use cases for that? Are we talking about just small workflows and then more complex workflows use something like Step Functions? Or what was the intent of building Lambda destinations?

Holly: Yeah. That's a great question. When we built destinations, it's really designed so that you ... You used to just hit an asynchronous event and you would fire and forget it, and you wouldn't know really how it continued or be able to pick it up. We built the destinations in order to allow people to do those completions, those continuations. In terms of, if you get to those really complex use cases of, "Hey, do this then that," and all this logic and branching and things like that, then Step Functions is a great way to go, because it's really designed more for those complex workflow situations, and is probably going to be an easier use case for that.

Jeremy: Right. Yeah. No. I love Step Functions. I mean, I think that the way that you can do parallelization and that you can fan out and you can fan stuff back in, and you've got all the wait timers and things like that. I mean, it just a very, very good solution. Back to the destination thing though, so one of the things that's really great about them is again, the reduction of my code, because every time I write code I'm introducing some liability into the system.

If I can just finish something and the output of that automatically gets sent and guaranteed to go to some of these services, that's really great from an asynchronous standpoint. From a synchronous standpoint, I would love to have that too. I don't know if that's something you guys are thinking about or is something that you would potentially put in, but I really do like the synchronous use case.

Holly: We haven't seen a lot of requests for that use case.

Jeremy: All right. Well, I would like you to add it. That's on my AWS wish list.

Holly: Okay. I've got it in the intake.

Jeremy: Awesome. All right. All right. I want to move on to serverless architecture in general and maybe just application architecture in general, not only how AWS does it, but how other people should be doing it. We know that there's a lot of isolation with Lambda functions and with Firecracker. I mean, so you're pretty good from a blast radius when you build single-purpose functions. Of course, there's the microservice pattern or microservice designs where you put a couple of Lambda functions together into a single cloud formation stack and things like that.

I'm curious, how does AWS add additional security or build bulkheads? Is that something you do in a single account, or do you have multiple accounts or a separate account for each microservice?

Holly: Yeah. We recommend using a separate account per microservice and then also thinking about an account for each of your environments as well, your pre-production environment and your prod environment. Each one should have its own account as well. What that does for you, if you think about it, a lot of times, two pizza teams own a service or a small set of microservices, and you want to reduce the number of people who can actually access those services and make changes. I mean, it's an operational risk.

It's also a security risk having too many people have their hands on a microservice. You really want to make sure that the people who can access it are knowledgeable and know what they're doing. That will help you have a high availability as well as ensuring security. Of course, availability comes back to not only potential for someone to make a change that is a breaking change, but also things like ensuring that your limits are used and planned for in a way that makes sense for you.

Jeremy: Right. Yeah. No. I love that and I'm glad that we've cleared that up. You've heard it here, AWS separate microservice or separate account per microservice. I do love the idea of microservices and I love that you have all of that isolation, even another level of isolation, and you have the ability to set the limits. Your concurrency limits and all that stuff can all be set per a microservice.

Holly: Yeah. I really like it too. Hopefully, everyone who has a business, that business is growing, you're scaling your service. I know we're scaling our service all the time. You build new features. You break apart services into smaller units to help scale with your teams. As you do that, if you thought about it as microservices with account permissions, then it also makes it easier for you to transition service ownership over time and have a new team pick it up with just that group having access. Kind of [inaudible 00:18:16] for growth as well.

Jeremy: Right. Yeah. No. I love that. I absolutely love that idea of breaking things up because you also have all this extra control over things like concurrency. You can control all those different things. Those different limits are controlled per group. Then the other thing that is great about that too, is each individual account is going to have its own roll-up of billing, so you can actually see what the cost is per microservice, which is pretty interesting.

Holly: Right. Your ownership as well, right?

Jeremy: Right.

Holly: Like when you want to go to your teams and say, "Why is this being billed to me this much?" It's really easy to go and tease that apart and talk to the right team member and get the right answers rather than charging around a very large organization try to figure it out.

Jeremy: Right. Right. All right. Awesome. The other thing that I'd love to talk about is this idea of what are the next workloads at that Lambda is going to be able to handle? If we think about machine learning, EFS handles some of that, but there's a lot further to go in order to handle that type of use case. Maybe support for other legacy databases, maybe even just larger memory, for example. What are some of these new use cases that we're hoping to unlock?

Holly: Yeah. We're looking and continue to look at different types of compute as well as larger memory sizes. Of course, larger memory, because the way we do, the memory and CPU go hand in hand, so larger memory also implies more cores. Again, it's back to that big data, more compute-intensive workloads that we know can be unlocked by bringing more memory and more CPUs to customers' specific use cases. Then the one thing that I know we get asked for, it's increased duration, but one of the things ... And 15 minutes, is it right? Could it go further?

It could probably go further, but one of the secondary considerations is we want to cycle our customers' execution environments. The reason why we want to cycle those execution environments is because that helps with security. You don't want something really that runs indefinitely. You want it to be bounded in time, because then you know that you're getting that cycling of the environment, which then you know that you have a clean environment and that your workloads are safe and secure.

Jeremy: Right. Right.

Holly: Yeah. Those things come hand in hand. Do you want to cycle? How often do you want to cycle? The longer you go you don't want it to be indefinite.

Jeremy: Yeah. No. Right. No. I definitely agree. I mean, I think 15 minutes is maybe a little bit arbitrary, but it's a good number. I mean, maybe 30 minutes would be better for some workloads or whatever. Certainly, that's one of the things I love so much about Lambda is the statelessness of it. The longer that container runs, the more potential there are for everything. Memory leaks, security issues, or whatever, even if you've got variable saved in your global scope and things like that.

Yeah. No. I mean, but maybe bumping it up a little bit. I don't know. That might work. What about GPUs? Have you heard anybody asking for GPUs?

Holly: Yeah. I think that was when I ... I didn't talk as much about the potential for GPU, but we certainly hear from customers an interest and we are evaluating that as well.

Jeremy: Awesome. All right. Here's something that ties back to the vision and maybe this is the AWS vision, maybe this is your vision, but are we ever going to get to a point where like a hundred percent of our compute is serverless? Where we have just no need for containers, sub-millisecond cold starts. Maybe we have some coordinated parallel compute where Lambda function can talk to one another. Maybe just from the Lambda perspective, is that your goal? Is it to, I guess, take over the compute world with serverless compute?

Holly: From a Lambda perspective, we're going to continue to work to remove limits and allow customers to bring more and more workloads to Lambda and to serverless, because we think it has such value. In terms of, will we get to a hundred percent? I think no one knows and only time will tell. From a Lambda perspective, we're going to do everything we can to continue to make it a great platform for customers and to remove things that get in the way of that for them.

Jeremy: Right. Yeah. No. I mean, for me personally, I would love it if I never had to manage another container or a server ever again, and every use case was solved. I think one of the big ones that comes up is the idea of cold starts. I mean, you look at a couple of other platforms that exist and again, they might be limited in terms of their language support and things like that, but some of them maybe they run on like the VA platform or something, have very, very, very, very low cold starts. This is obviously a continued complaint.

I mean, I know we've got Provisioned Concurrency, but is that something where AWS is going to continue to keep pushing and pushing and get that cold start down to where it practically doesn't exist?

Holly: We're going to continue to drive down our steady state case of cold starts. We absolutely continue to work on that. We put a lot of focus into both our warm and cold latency to make our services as fast as possible. We put Provisioned Concurrency there to address customers' immediate need, because we know it takes a little bit more time and some real heavy engineering lifting to address it without that, but over time, those will get closer and closer to parity.

Jeremy: Awesome. All right. I want to talk about this idea of the Lambda supercomputer. I'm sure you're familiar with Tim Wagner, obviously. You worked with him. In terms of this idea of Lambda functions that can run in parallel and can talk to one another. I think the way that that Tim was doing it with his test project was this idea of doing NAT punching and having them being able to coordinate with one another. This could open up a lot of use cases, especially, for big data, for genomics or anything big like that.

Are you thinking about making a way for Lambda functions to potentially mesh together and do this supercomputer use case?

Holly: Yeah. It's certainly an interesting use case, and it's something that I think Lambda is well-situated for, especially if you think about it from the standpoint of all the concurrency and bursts that you can spin up. You can spin up a lot of different nodes and then just based on routing the messages in the right way, end up with this large scale compute environment. I certainly think it is a possibility. It's certainly something that Lambda could do.

Jeremy: Right. Yeah. No. I mean, and Lambda, obviously you can use fan-out patterns and some of these other things, even Step Functions to coordinate and do parallel compute, but I do think it would be really interesting if there was a way for Lambda functions to directly talk to one another. I think that would open up some really interesting use cases. All right. Another thing I want to talk about is I guess, this idea of complexity in serverless. There are a lot of building blocks. Lambda is just one small piece of it.

If you're just building a small application and maybe you're just using Lambda and API gateway and maybe DynamoDB, and that's relatively simple. You can put it all into one cloud formation template, or using SAM or something like that. That's relatively easy. Then you start integrating EFS and then you need an SQS queue, or maybe you're reading off of a Kinesis stream or you're using EventBridge, or you've now got 15 different microservices, all separated into different accounts. It gets really difficult to wrap your head around the complexity that's there.

I'm wondering ... And I know that there's open source things for Terraform and there's the CDK and obviously SAM and some of these things. Is there something where maybe the Lambda team or AWS in general is looking at another level of abstraction? I know you've got SAR and some other ways that you can package up some use cases, but is there something on the roadmap for what that next level of abstraction is to make it easier for companies to come in and adopt best practices and things like that? What's the vision around that?

Holly: Yeah. SAM which you referenced is intended to be that next higher level abstraction for building serverless applications. We launch every new feature with SAM support. For instance, EFS just launched and we wouldn't launch it without having that support. We also are big believers in the broader tooling ecosystem. The reason why we believe in that is we don't want customers to have to learn yet another toolchain, if they have a toolchain that they love. We support meeting customers where they are with the toolchains that they find most comfortable with.

That's a dual strategy. We build SAM as the top level of abstraction for serverless, and then we support a variety of third-party tools as well, so that customers can use those.

Jeremy: Yeah. No. I think the support for third-party tools is great. I know with observability, there's a whole bunch of tools that are there that can help you. I mean, just from an adoption standpoint, as you add complexity ... And again, it's going to get more complex over time. That's just how these things work. Is that something that potentially is going to hurt adoption if it just becomes harder and harder to integrate these services into your existing toolchains or into your existing workflows?

I mean, even like testing, for example, it's very, very hard to test locally. You have to test in the cloud. What's the vision there to just bring it closer to developers?

Holly: Yeah. To that point, we are continuing to invest and have a real focus on the developer tooling and the developer experience. We know that that's an important element of serverless. It's not just having it run great on your data plane. It's also how are you interacting? What's the tooling? What's the customer experience? Then, how do you operate it in and out as an engineer? It's nice being an engineer. We're working with a lot of engineers and then going back to the ... and adopting it ourselves, we see where we can improve as well, even firsthand.

Jeremy: Yeah. No. I think that's super important from an adoption standpoint. Just, I know a lot of developers have a hard time trying to do the testing and wrap their head around all these different changes and stuff or just the different way that some of this stuff works. All right. I'd love to ask you this question too. I tend to ask a lot of my guests, where do you see serverless going in five years? Or where do you think serverless will be in five years? You actually have a lot of control over this. Where would you like to see things go?

Holly: Well, where I would like to see things go, and going back to our earlier conversation on why not serverless? I would love to see the industry be running on serverless, just because I think it brings such a great experience for engineers. Going back to my experience and you heard 25 years in the industry or 25 plus, I've seen all the phases. I've seen the phases of a technology adoption. I've seen what we've asked of our engineers over time. Back when I started, it was you learned a language and you learned it really well, and you programmed on it.

Then you ended up with polygon and you ended up with you're no longer on a box. You're driving a whole bunch of microservices and coordinating them together. Then testing, you used to have a test team who had test and now then you became the test team as well. All this stuff is good, but we've asked engineers to do more and more and more and more. I like that with serverless we are actually asking them to do less and to focus on the stuff that's really value-added.

I think that's a positive outcome for engineers. So when I think about it as a long-term engineering leader driving the most agility out of my teams, I think that serverless ... And I hope to see where serverless it's a why not.

Jeremy: Right. Yeah. Totally agree. Awesome. All right. Well, Holly, thank you so much for being here. I know that you've successfully managed to avoid social media, which is amazing. You're not on Twitter, but if people want to get a hold of you, how do they do that?

Holly: Yeah. They can connect with me on LinkedIn. I'm really easy to find. There are not that many Holly Mesrobians in the world.

Jeremy: Awesome. All right. Then also the two Under the Hood of AWS Lambda from re:Invent 2018/2019, I will put those in the show notes. Thank you so much for being here, Holly.

Holly: Great. Thank you.