Episode #92: Streaming Data at Scale Using Serverless with Anahit Pogosova (PART 2)

March 15, 2021 • 42 minutes

In this episode, Jeremy finishes his chat with Anahit Pogosova about where and why we'd use Kinesis, how Lambda helps you supercharge it, how to embrace (and deal with) the failures, common serverless misconceptions, and much more.

Watch this episode on YouTube:

About Anahit Pogosova

Anahit is an AWS Community Builder and a Lead Cloud Software Engineer at Solita, one of Finland’s largest digital transformation services companies. She has been working on full-stack and data solutions for more than a decade. Since getting into the world of serverless she has been generously sharing her expertise with the community through public speaking and blogging.

Twitter: @anahit_fi
LinkedIn: https://www.linkedin.com/in/anahit-pogosova/
Solita: https://www.solita.fi/en/
"Mastering AWS Kinesis Data Streams, part 1”: https://dev.solita.fi/2020/05/28/kinesis-streams-part-1.html
"Mastering AWS Kinesis Data Streams, part 2”: https://dev.solita.fi/2020/12/21/kinesis-streams-part-2.html
AWS Community Day Nordics 2020: https://youtu.be/gtE2o8qsq-4

Watch this episode on YouTube: https://youtu.be/7pmJJcm0sAU

This episode sponsored by New Relic and Stackery.

Transcript

Jeremy: So you mentioned poll-based versus stream and things like that. So when you connect Kinesis to Lambda, this is the other thing too, I think that confuses people sometimes. You're not actually connecting it to Lambda directly for pretty much all of these triggers in these integrations. There's another service that is in between there. So what's the difference between the Lambda service and the Lambda function itself?

Anahit: That's a great one because I think it's, again, one of those very confusing topics, which are not explained too well in the documentation. And the thing is that when you're just starting dipping your toes in the Lambda world, you just think that, "Okay, I write my code, and I upload it and deploy it, and everything just works. And this is my Lambda," right? But you don't really know how much of the extra magic is happening behind the scenes, and how many components are actually involved into making it a seamless service. And there is a lot of components that come into ... so you can think of a Lambda function as the function that we actually write and deploy and invoke. But then the Lambda service is what does all the triggering, invoking and batching and error handling.

And it really depends on the way the Lambda works, or the way long the service works. It really depends on the invocation model, is you prefer to the poll based, not poll based. So again, one thing that is not too clearly explained, in my opinion, is that there is actually three different ways you can work with Lambda or communicate with Lambda. So you can invoke a Lambda synchronously. So request response traditional way, and the best example, I think, is API gateway, which does that so it requests something from Lambda, it waits for the response. Then there is the async way, which is one of the most common. So you just send something to Lambda and you don't care about what happens next.

Jeremy: Which uses an SQSQ behind the scenes to queue ...

Anahit: Exactly. Yes. That's also like fun facts that you learn along the way. But the point is that like services like SNS, for example, or S3 notifications, they all use the async model, because they don't care about what happens with the identification. They just invoke Lambda and that's it. But then there is this third, gray area or a third totally different way of invoking the Lambda function, and it's called poll-based. And that's exactly how Kinesis operates with Lambda. And it's meant for streaming event sources, so it's both Kinesis data, DynamoDB streams. Also, Kafka currently uses poll-based model. And it also works with the queue of event sources like SQS.

Jeremy: Right. SQS, yeah.

Anahit: And Amazon MQ, I think they also use them, the poll-based method. And what poll-based invocation or the component that is most essential in the poll-based model, it's called the event source mapping. One of the misunderstood components or one of the hidden heroes, I would say, we find in Lambda, because it's an essential service or essential part of the Lambda service. And event source mapping actually takes care of all that extra things that Kinesis plus Lambda combination is capable of. So it's responsible for batching, it's responsible for keeping track of this point in the stream and where a shard, where it's ...

Jeremy: A shard iterator, because anybody wants to know the ...

Anahit: Yes, exactly, shard iterator.

Jeremy: ... technical term for it.

Anahit: Yes, thank you. And, yeah, the most important for me, it handles the errors and retries behind the scenes.

Jeremy: Right.

Anahit: And basically, if you don't have event source mapping, you can't have batching. So it takes care of accumulating, or in case of standard, consistent consumer, it pulls your Kinesis stream, on your behalf, it accumulates batches of records, and then it invokes your Lambda function with that batches of records that it accumulated. Again, in case of enhanced fan-out, of course, it doesn't poll, it gets the records from the Kinesis stream directly. But then from the perspective of your Lambda function doesn't matter, it just gets triggered by the event source mapping, because as you've said yourself, it's not the Lambda that you connect to Kinesis stream, it's the event source mapping that you connect to the stream, and then you point your Lambda to that event source mapping, so.

Jeremy: Right. So you can connect a Lambda function or the Lambda service directly to the Kinesis stream itself, or you can use enhanced fan-out and push it to the Lambda function. Although, for all intents and purposes, it's pretty much the same thing.

Anahit: Yeah. And for your Lambda function, it doesn't really matter how that data ended, or how those records ended up there, you just get a batch of records, and then you deal with it. And I mean, all the rest is pretty much the same from the perspective of a Lambda function, because it's nicely abstracted behind the event source mapping, which hides all that magic that happens behind the scenes.

Jeremy: Right. So you mentioned some aggregations stuff in there and about like Windows and time windows and things like that. So tumbling windows, that's something you can do in Kinesis, as well. Can you explain that?

Anahit: Yeah, it's a feature that actually came out very, very recently. In the end of the re:Invent, I would even say, and I think it was like one day before I was going to publish my second part of my blog post that was already finally ready to submit it, and then in the evening I get this and I was like, "Okay, I have to write a whole new chapter now." But it is a very interesting aspect, you can use it with both Kinesis and DynamoDB streams, actually, so it's available for both. And it's a totally different way of using streams, which wasn't there before. So with Lambda function you know that you can retain state between your function executions unless you are using some external data source or database.

And here, what you're allowed to do with this tumbling window is that you can persist the state of your Lambda function within that one tumbling window. So tumbling window is just a time window, it can be at maximum of 15 minutes, and all your invocation within that 15 minutes max interval, they can pass the state and aggregate the state of passing to the next Lambda invocation. So you can do cool things like real time analysis that you could previously do only with Kinesis data analytics, for example. Here you can do right inside your Lambda. And then when the interval is ending, the 15 minutes, for example, interval is ending, you can send that data somewhere, let's say to a database or somewhere else. And then the next interval is starting, and then you're accumulating again.

And so it's pretty fascinating in the sense that it allows you to do something that wasn't there before. It's a completely different way of using the Lambda basically, with the streams. But of course, there are limitations with that, you can only aggregate the data on the same chart because one Lambda is processing one shard at a time.

And then there is also this thing called paralyzation factor, which we haven't talked about. But which basically means that instead of having one Lambda reading for one shard at a time, you can have actually up to 10 Lambdas that are reading from that same shard. So you can boost the power of reading, because if one Lambda, for example, if Lambda execution takes too long, and you can't keep up with your stream, then you can either add more shards, for example, to your stream, but it's expensive, that takes time and has some limits. Or then you can immediately just throw more Lambdas at it, just say like more horsepower, and they will take care of it. But if you have more than one Lambda reading from a chart, you can't use this new tumbling window features, which makes sense, of course.

Jeremy: Right. And that depends on what you're doing because I mean, the idea of the parallelization factor, such a hard word to say. But the whole point of that is to say you're reading up to 1000 records per second off of this stream. And if for some reason it takes more than a second to process one of those records or whatever, then you're going to see the problem with not being able to process enough records quickly, because you're backing up your stream if you're writing to it. So again, it's just one of those trade-offs.

Anahit: Yeah, but again, this new feature, I think it's going to be developed still, maybe someday it's going to have some kind of support for it, though I can't see really how under the hood, it would function between different Lambdas in the central. But anyhow, this, I think is a very cool new thing that I'm actually eager to try out in production if I just figure out a case for that because it just looks so cool. And it's so simple to do.

Jeremy: Yeah. Well, I mean, and the other thing is, is that depends on what you're doing with it. So the use case that I've seen, and actually I started playing around with like SQS batch windows to try to do something similar. I know they're different, but they're same idea where when you're doing aggregations, if you're just reading off a stream, and you're trying to aggregate, you have to grab that data from somewhere, like you said, because Lambdas are stateless.

So you have to query a DynamoDB database or something like that, or table and then pull back what the last aggregations were. And then you read in the data from the stream, and then you write, you do your aggregation there, and then you write it back to the DynamoDB. If you're doing that hundreds of times a second, that is pretty inefficient, where if you just did it, set your tumbling windows to one minute even, and you could read thousands of records, and then be able to just write that back to the database one time, just the efficiency gain there is huge.

Anahit: Exactly, exactly. If you have a use case that is like that, because I personally don't, that's why it's ... I'm trying to come up with one in the future ...

Jeremy: Come up with ...

Anahit: Yes.

Jeremy: Find the problem for the solution, right?

Anahit: Yes, exactly. But yeah, it can be very, very helpful. And again, it's pretty straightforward to using it. So I can see a lot of people loving it really.

Jeremy: Yeah. Awesome. All right. So another thing I just want to mention, because we keep talking about Lambda, and we mentioned the concurrency thing and some of those other bits. In terms of provisioning shards and having one Lambda per shard, and then potentially, if you do the parallelization factor, 10 Lambdas per shard, if you had 100 shards, because you had a lot of data coming in, and you had the parallelization factor turned on, then you've got 1000 concurrent Lambdas being run at once, which I did ...

Anahit: And guess what happens next?

Jeremy: So what happens next, yeah. And the people don't know the soft limit in any region is 1000 concurrent executions, for your Lambda concurrency. So, that's just something that people need to think about.

Anahit: Yeah, for sure. And that's something I bring up quite often, because we've been there, done that, but actually 100 shards is not even too much. There are apparently streams with 1000s of shards. So we have something like 40 shards in our stream. So it's a really quite, quite decent amount. But yes, as you said, exactly, so if you have, for example, 100 shards, and then you have a parallelization factor of 10, you will have 100 times 10, 1000 Lambdas, running or consuming that stream at all time. So there will be constantly 1000 Lambdas, concurrent Lambda invocations. And you probably won't run into any problems until there is some other Lambda in that same region in the same account, that is probably very business-critical, that does something very important, and then it starts to fail for some unknown reason. And that reason is not even that Lambda, the reason is your stream, which is consuming the entire budget that you have allocated for Lambda.

So yeah, it's something people oversee quite often that though Lambda scales endlessly, potentially. In reality, all the services, they come with the safety mechanisms of soft limits, there is no service in AWS, I think that that is ... it comes out of the box with no limits, just use it as it is. So basically for your own safety, there are some soft limits. And on the other hand, though, they are soft, which means that you can increase them by submitting a ticket to support. It will take some time I warn you, especially if you go higher than normal.

But though you can do that, there still is going to be a limit. There is always going to be limit. And you just need to know that it exists because one day, you're probably going to hit it. And so you have to monitor that all the time. And yeah, that's one thing to keep in mind. But that's that's a common thing with SQS as well, for example, and maybe SNS. So all the services that can scale Lambdas pretty much like out of hands, then you're faced with that concurrency, Lambda concurrency limits that you have to be careful with.

Jeremy: Right. One limit that I love that has nothing to do with Kinesis but with SNS is for, if you send SMS messages with SNS, I think the default limit is $1 spend per month. So if you send like 200 text messages or something like that, it ends up cutting you off. Maybe not 200, you can probably send more than that. But it is a very, very low limit. And I think it's just because they don't want people, I don't know, spamming SMS, or something like that. But anyway ...

Anahit: Can you increase it? Is it soft limit?

Jeremy: No, no, you can increase it, yeah. But you have to submit the ticket. But basically, I remember, I set up a new account for something and we were doing all these alarms, and it was like, within two days, I get a message saying ...

Anahit: "That's it. That's enough."

Jeremy: ... "No, you can't... no, you've exceeded your limit." And I was like, "Well, that was fast." So and if you're using something like control tower, or any of these things to provision hundreds of accounts, some of these soft limits that are in there can affect you. So, whether it's Lambda or some of these other ones, but ...

Anahit: It's surprisingly easy to reach all of the soft limits, as long as ... I mean, as soon as you go to the real world cases from those who, "Hello, World!" cases. And yeah, it's not a problem reaching the limits, per se, the problem is many people don't know that they are there.

Jeremy: Yeah, yeah good point.

Anahit: That's when the problem starts.

Jeremy: Good point. All right. So speaking of limits, there are limits, obviously to Kinesis. And some of these things that maybe even go beyond some of the soft limits. I mean, there's just limitations in distributed systems, and there's limitations in network throughput and some of those other things. And so as you hit some of those limits, or maybe let's just talk about errors in general, as you start to run up against problems, whether they're caused by limits or whether they're caused by something else, what are some of the things that I guess, could go wrong when you're using Kinesis?

Anahit: Right. Well, that's my favorite topic, really. But I mean, with every service, as you said, I mean, nobody says it better than Werner Vogels who says that, "Everything fails all the time." And I love that phrase because that's true.

Jeremy: Very true.

Anahit: And it's not because you want to be pessimistic, but rather, because you want to be prepared, and you want to sleep better at night. Because if you're not prepared, then surprising things will happen eventually. And for me personally, with Kinesis, or any other service, really, when I start working with a new service, first thing basically that I ask is that, "What are the ways in which it fails? What are the possible errors? What are the possible limits? Are they hard limits? Are they soft limits?" All this. And even what's the built-in functionality for retries, for example? What are the default timeouts? And that kind of thing?

So those are very common that things that you start to question after you have got a lot of headache with one of the services. And then you start to question those specific questions when you start working within your service. And with Kinesis, you can probably separate the errors for writing to Kinesis stream, and reading to Kinesis stream.

So for writing, well, first, it's nice that AWS SDK has a built-in functionality for retries for all the failures or system failures that happen and timeouts as well. And it's not documented or it used to be not documented too well, because personally, I learned about the built-in retries, when I was developing unit tests, and then they were behaving weirdly, and I was like, "Something's going on here. What's that?" And then I realized, oh, it retries three times by default for every system error. Wonderful, that's a wonderful news.

But maybe not so wonderful news is the thing called partial failure. And it's actually very common for all the services that are using batching. So what it means is that when you, for example, write a batch of records to Kinesis, it's not an atomic operation, it's not either the entire batch succeeds or entire batch fails, and you get an error code back from Kinesis. The reality is that you almost always get a success code back from Kinesis, and it's very misleading, because parts of that batch could have failed, and you don't know about that. And what you should do, instead of just waiting for an error to come back, what you should do instead is to look at the response that comes back from Kinesis. And to see if there is this field called a failed error count or something like that, which basically tells you where they're actually failures within that batch that didn't go through to the Kinesis. And that can happen, for example, because of throttling. So some of the records just didn't made it, they didn't make it to Kinesis stream.

So, that's that's one of the basically main issues that we have had with Kinesis streams. And you have to take care of those partial failures manual and you have to do some smart retries and backups, and random detours and things like that. And then there are the timeouts of course, which always happen and you need to know the kind of the default settings for the timeouts. Because in case of Kinesis for example, the service times out after two minutes. So, and actually, there're two timeouts. There is a timeout for a new socket connection, and then there is a timeout for sending a request.

So first, you will wait two minutes to create a socket connection, and then you will wait another two minutes for sending the request, then it will be retried three times. And then like 10 minutes in and you're still waiting for one batch to go through, in a pessimistic scenario. And again, those are things you don't really see in the documentation right away, and those are the things that you end up finding out because you have some problems. And the other point is that you almost or you always have to set the timeouts to a lower value than the default two minutes.

Jeremy: Yeah, those defaults ...

Anahit: It's crazy.

Jeremy: ... defaults are not great.

Anahit: Yeah. No, no, not at all. So those are the main things with writing. So like partial failure, sometimes timeouts and that kind of things. But with reading, things get even more interesting, because there's so many options. And one of the things that is very common, it's called poison pill record.

Jeremy: Yeah, the poison pill.

Anahit: Oh, the poison pill, yes. And nowadays, it's actually pretty avoidable, but let's get back to it later. But the idea of poison pill is that if you have a Lambda function attached to your Kinesis stream, and it's reading from the shard, and everything is fine, until there is some corrupt record in your shard for some reason. And your Lambda function tries to read that record and it fails, and then you don't have proper error handling because well who needs error handling, and then your entire Lambda function fails, right? But when your entire Lambda function fails, what happens is that Lambda returns or as we know, event source mapping actually returns the entire batch back to the stream. And then it retries or makes Lambda retry with that entire batch that just failed.

Jeremy: And speaking of defaults, it retries 10,000 times I think by default?

Anahit: No, you're too optimistic. By default, it retries forever.

Jeremy: Oh, forever.

Anahit: Yes, it retries until the data expires, which means from 24 hours to up to seven days or one year. But let's explain why it's not good, it's not a good thing. Well, first of all, you don't want to have all these unnecessary Lambda invocations that don't do anything. They just send to the same records, and then they ...

Jeremy: They keep failing ...

Anahit: Yes. They keep failing at the same point of the batch, and then they start all over again. And it's pointless. But the problem is that in, well, let's say 24 hours, let's take the optimistic scenario, so in 24 hours, the batch expires finally, and Lambda can forget about it. So the batch gets deleted from the stream, and then the next batch comes in. But the problem here is that by that moment in time, probably you're streaming or your shard is probably filled with records that were written around the same time as the records that you were trying to process, which means that they are expiring around the same time, like the previous batch.

And if your Lambda is not quick enough, you might end up in a situation when records end up falling from your stream. This overflowing sink analogy that I had in my blog post, when you basically pour water to the sink more quickly than you can drain it. And then the water ends up on the floor. So, that's the exact situation. So basically, what ended up happening is just because of having one bad record, and no proper error handling, you ended up losing a lot of, or you can potentially end up losing a lot of records. So hence the poison pill because that one bad record poison, poisoned the entire shard basically.

Jeremy: And I'm actually curious, something I've never tested this, but let's say that you get a batch of records, it's only say 100 Records, because there's only 100 records in the stream. So it sends that batch to Lambda and then Lambda fails, because there's a poison pill and it sends those 100 records back. If another 100 records come in, because it's still under whatever your threshold was for batches, would it then send in like the 200 the next time and then will it keeps sending in up to the full batch amount as it retries those batches?

Anahit: I would imagine it should really. That would make sense. We just never had the situation because we usually have the complete batch.

Jeremy: Right, you have a full batch.

Anahit: But I would imagine that's how it should work yet, because it accumulates entire batch. But it doesn't matter because it will stop at the exact same record. It processes them in order, it will stop that exact same record. And then well of course, if you process them in order, you can process all the records in parallel, if you want to. But then you want to have the ordering. But yeah, it's very funny situation, and it's very easy to end up in it, and been there, done that once again, but luckily ... Well, first of all proper error handling in your Lambda function where you don't allow the entire function to fail just because of one record that didn't go through. And then there are different ways to approach that.

And then the things that I was talking about a lot or mentioning a lot is the error handling that comes out of the box with event source mapping. And nowadays, and actually, it's developing. And each year, they are adding new functionality and new possibilities that weren't there before. So what you said about 10,000 retry attempts, it's a totally new feature, it wasn't there. They added this maximum retry attempt settings to the event source mapping. But again, by default, it's minus one, which means that it does it infinitely. So, but you can set it to up to 10,000 if you want to. And then you can set the maximum age of the record that Lambda will accept. So if the records get older than some specific age, I think it can be up to one week even, you can ... your Lambda will keep the records in one process them.

And then there is on-failure destinations where you can send the information about your failed record if everything fails. Then I think that one of the fun possibilities is the cold batch bisecting. So it's when you basically split your problematic batch in two and then Lambda tries to send these two parts separately, and then hopefully, the other one succeeds. And then it continues with the failed one and splits it recursively further until hopefully, you end up with just one bad record.

Jeremy: Just the one.

Anahit: Yes, but on the way there, you actually end up sending same records over or processing same records over and over and over again. So it's not optimal. And then there was one more announcement around the same time, because of which I had to update my workflows. I think it's called custom checkpoints.

Jeremy: Custom checkpoints, yep.

Anahit: Yeah. It's basically common sense. Instead of just failing your Lambda saying, "Well, no can do. There was a batch, I don't know, something bad happened." Instead of that, you can return the exact sequence number of the records back to the stream, the record that caused the problem. So if you went on with your batch, you processed your record, and then you return that and back to event source mapping, and it knows that, "Okay, next time I retry, I will start from that end, rather than starting, again, from scratch." So.

Jeremy: And that should eliminate the need to do the bisecting?

Anahit: Yeah, that's ...

Jeremy: The bisecting. Yeah, right. So if you're ...

Anahit: That's what I'm thinking.

Jeremy: ... have an existing system that is using bisecting, you don't have to change it. But AWS likes to do that, where you keep the old functionality in, but there's a better way to do it. The same thing with dead-letter queues and Lambda destinations, right?

Anahit: Yes, exactly. But for the sake of it, if you like the idea of just kind of splitting your batch, and sending them separately behind the scenes without doing anything, well, you can have that. But yes, of course, this new functionality would be so much better, because you would avoid all this unnecessary read processing of the same records. Yeah.

Jeremy: Right. So what are some of those other common issues? I mean, you mentioned timeouts, and maybe like network issues are obviously happened, but what are some of maybe the other distributed network things that pop up when you're using Kinesis?

Anahit: Yeah, I think the timeouts and network problems are really the core of it, most of the times really. And the other one that I've mentioned several times already that at least once a guarantee, so-called a response guarantee, so that it then prevents the ... Basically, with Kinesis, you are not guaranteed to get your data exactly once, it's at least once. So you will have duplicates in your stream. And it's because of, for example, the retry functionality that we just discussed, both with sending and receiving the records. But also, the fact that, for example, the network issues also contribute to that because you might have sent a batch of records to Kinesis, but never heard back from it. You just didn't get the message so to speak. And then you will retry it because you don't know either it went through or not. And then maybe it did go through and then you end up writing the same batch all over again.

These are the things that happen pretty much all the time. And the only thing or the only way to deal with them, it's just to know that they happen and to be prepared for them, with at least once guarantee your downstream systems must be resilient in the sense that they won't change if the same data comes over and over again. So they need to be able to handle that repeating records in your stream. And then with the network problems, well, there's not much you can do about network problems. Of course, if you have a producer that is running inside VPC, creating a Kinesis VPC endpoint is a good idea, so the traffic won't leave your VPC. But pretty much, that's the only thing you can do about those.

But on the other hand, you can handle those issues with ... or let's say timeouts are also network issues in some way or quite often. And the thing that we were discussing before that default timeouts are really not that great, most of the time you need to adjust those with Kinesis, especially, not especially, but that's a good example, maybe. But actually, one fun thing I remembered about the timeouts is related to DynamoDB, which are probably familiar to you, in a sense, because the DynamoDB also has some ridiculous default timeout, like a minute, two minutes, something like that.

And when a couple of years ago, at re:Invent, I was speaking with one of DynamoDB guys, and was asking that, "Okay, we have this API that needs to retrieve data from DynamoDB, and it needs to be very, very quick. So latency should be very low." And we used to have Lambda in between, so Lambda was doing calls to DynamoDB. And the first thing he said was, "Reduce the timeouts." Because apparently, DynamoDB can timeout pretty frequently. So it's much better to drop the connection sooner rather than later. So you set the timeout to, I don't know, 1000 milliseconds, and then you let the SDK handle the retry, instead of waiting for, like forever. But that was funny. That was the first thing that they recommended me to do. "Okay. "

Jeremy: Yep. Even though they set those defaults pretty high, but ...

Anahit: Yeah, exactly.

Jeremy: All right. So then, in terms of monitoring this, though, I mean, that's one thing that I really like about Kinesis is that you do get quite a few metrics where you can look and see how your shards are doing, how quickly they're being drained, how backed up they are, and stuff like that. What are some of those, I guess, the most important metrics that you want to keep your eyes on?

Anahit: Right. So of course, there are separate ones for writing to the stream and for reading to the stream. So I would say for writing, what is it, right throughput exceeded exception, is like the metrics that tells you that you exceeded the throughput of your stream basically. So that's the one that was pretty much eye-opening for us, because well, the thing is, I think, with metrics in general is that they are at best minute-based. So they are aggregate metrics, or aggregate values over one minute time, right? And with Kinesis, as we have mentioned several times, all the limits are per second. So it's 1000 records per second one, one megabyte per second. And that's the information you don't get from the metrics. So you don't see the picture, per second picture.

So there is a metric that tells you how many records come in and how much open data comes in. And you might look at those and think, "Okay, the threshold is still far, far away, I for sure have no issues with the stream." And then you notice that there is this provisioning throughput exceeded exception metric that is being popping up, and you figure out that, "Okay, apparently, bad things can happen even in this situation." Of course, it's because of, for example, the network's issues that we discussed before, or spike in traffic, because the records ...

Jeremy:
Lots of traffic.

Anahit:
Yep. The records arrive to your stream on uniformly in a way, so it might be that one second, it's like 5000 records. And the next second, it's just like three direct records. And you can see that in metrics, or even one metric. You have to observe the metrics that tell you what goes wrong in a way. That's the key, I guess. And same goes to reading from the stream, really. There is this rich provision throughput exceeded, which is basically only for a shared iterator case. So the standard reading, consuming the stream. So when you exceed, for example, two megabytes, or you exceed this five requests per second, which we don't even go into. Read my blog post, you will know what I'm talking about.

But you get those, and then there is, I think the most important one is the iterator age when it comes to reading from the stream, because that's the one that tells you that kind of age of the record, meaning how long they have been in that stream. And apparently, if the age increases, it means that you can't consume them fast enough. So then you might have a problem. They are with your consumer, for example, or you have too many consumers, and then you have to have the enhanced fan-out and things like that.

But they're basically like two, three metrics that you have to keep an eye on. And if you see any issues with those, then you have to dig deeper, maybe enable the enhanced metrics, which are not stream level, but they are shard level metrics. For each shard, you can have the same or similar information so you can diagnose it more precisely.

Jeremy: Right, yeah. And if it was only serverless, or I should say fully serverless, and do this automatically for us, that would be much better.

Anahit: Yes.

Jeremy: Well, so just like your blog post, this episode turned out to be quite lengthy. And but I hope people got quite a bit of knowledge from this, and are not afraid of using Kinesis, because it's an amazing service. Yes, it has all of those caveats that we talked about, but it's still an amazing service. But if you've got a few more minutes, I'd love to just pick your brain for a second, because I think there are a lot of common misconceptions about building serverless applications, and again, whether Kinesis is serverless or not, we'll put that aside. But just all of these different services, even Lambda, and having to build in the retries, and know about either bisecting or using the custom checkpoints or doing some of these other things, there's a lot that goes into it. So what are some of the ... and maybe just even from your own perspective, when you're building serverless applications or using fully managed services? Like what are just some of those misconceptions that maybe people have?

Anahit: Yes, I've noticed those, well, few of them actually when working with serverless. And people usually have strong opinions about serverless. It's either they go both ways. But I think many people assume that it's either very easy, or then and you don't have to do anything, everything is done for you, or then it's way too complicated. And I think, again, Yen Cui had a nice blog post lately about the complexity of serverless, or perceived complexity of serverless. And what he was saying is that serverless is not complex, it just reveals the underlying complexity of the systems that we used to build before. So all those things that were built in and hidden from everybody's eyes, but there was still there. Now, they are more obvious with using all the different components, and you connect them to each other, and you have all that ecosystem living there, but ...

Jeremy: Which gives you more control over the individual components as well.

Anahit: ... it gives you more observability, it gives you more control and all these nice things why we love serverless. So I'm all for it. But on the other hand, I think it's a simplistic view to think that fully managed and serverless, it means that you basically just deploy your code, and you have to worry about nothing. Because as we discussed with you several times already, yeah, you will probably get away with that on the "Hello, World!" level, it will be pretty much okay. But then when you get to the real world and real world scale, you actually do need to know in quite some detail how each and every service that you are using, how they work, and how they fail. Because once again, they will fail at some point, and you basically ... you need to know how they fail and what can happen just to sleep at night.

Jeremy: Yeah. And I also think just this idea, that again, they said it and forget it for simple things, like you said, yes, but just ongoing management, right? I mean, and optimizations and shards, refactoring code and with the shards thing, with monitoring that and saying, "Hey, we're starting to creep up to this next level, or maybe we're not processing fast enough, or maybe our shard iterator keeps pushing over a certain amount of time during certain times of the day."

Anahit: I'm getting anxious, now.

Jeremy: All right. You want to go back and look at all those metrics, right?

Anahit: Exactly! But that's exactly right, but maybe will sound scary. Well, we'll put it that way, but on the other hand, the ... again, well, it reveals the complexity your systems do have anyway. And the good news here, I think is that in case of AWS, there is a lot of commonalities in how services work.

Jeremy: Yeah, true.

Anahit: And once again, I think understanding of one service through and through will help you to understand all these issues with the distributed systems and under errors and built-in retries and whatnot. So you don't really need to remember every single thing by heart, and it's not as overwhelming as we make it sound at the moment. It does require some work, but I think it's well worth it.

Jeremy: I totally agree. Well, Anahit, thank you so much for taking the time to talk with me and educate the masses about Kinesis. If people want to find out more about what you do or want to contact you, how do they do that?

Anahit: Well, first of all, they need to read the blog. It's long, but I hope it's worth it, and it has some nice pictures, so some benefits. Then they can reach me on LinkedIn, first name, last name. And Twitter, again, first name, last name. And yeah, I think that's about it.

Jeremy: Awesome. And then the blog at solita.fi. And then you've got a really good talk that you gave, I think it was at AWS community day, maybe Stockholm. So, then that.

Anahit: Oh my God, it's been over a year already. That was the last trip that I made before ... it's horrible.

Jeremy: Isn't that crazy? I know. It's been a year, it's been a year.

Anahit: It's been a year.

Jeremy: We just celebrated or, celebrated I guess ... there was just a year passed for ServerlessDays Nashville which was the last conference that I went to in person. So I am looking forward to doing that again and bumping into people and talking to people about this in the hallway because those are the best conversations. So-

Anahit: For sure.

Jeremy: ... anyways, I will take all of this stuff, your Twitter, LinkedIn, blog, the two blog posts that you wrote about this, as well as that video talk from community at Stockholm. I will put all that into the show notes. Anahit, thank you again so much.

Anahit: Thank you so much, Jeremy. It was so much fun.