Episode #16: Serverless Workflows using Step Functions with Rowan Udell

September 30, 2019 • 42 minutes

Jeremy chats with Rowan Udell about the benefits of state machines, the core functionality and advanced features of AWS Step Functions, and some recommendations for building smarter serverless workflows.

About Rowan Udell

Rowan Udell is Cloud Practice Director at Versent, an AWS Premier Consulting Partner in the Asia Pacific region. Working with customer and internal Versent teams, he helps them deliver change at scale and speed using serverless and AWS native services. He co-authored the AWS Administration Cookbook and has published video courses on AWS.

Twitter: @elrowan
Blog: blog.rowanudell.com
AWS APN Ambassador: https://aws.amazon.com/partners/ambassadors/ambassador-apac/
AWS Administration Cookbook: https://www.packtpub.com/virtualization-and-cloud/aws-administration-cookbook (2nd edition coming out soon)

Transcript

Jeremy: Hi, everyone. I'm Jeremy Daly and you're listening to Serverless Chats. This week, I'm chatting with Rowan Udell. Hi, Rowan. Thanks for joining me.

Rowan: Hey, Jeremy. Thanks for having me.

Jeremy: So you are the technical director at Versent which is in Sydney, Australia. And you are also an AWS APN Ambassador. So why don't you tell the listeners a little bit about yourself, what Versent does, and actually, I'm kind of interested in this AWS APN ambassador thing, if you could tell us about that as well.

Rowan: Yeah, sure. So Versent is a premier consulting partner here in Australia and, you know, we work with a lot of enterprise customers, really helping them do cloud the right way. You know, if I was to describe us to a wider audience, we kind of want to see ourselves as the heart specialists for AWS. You know, when you have a heart problem, you go to see a specialist. You don't just go to any old doctor and we want to be that for AWS. In my role as technical director, I work with a lot of teams that are using Step Functions and other serverless technologies, building out applications on AWS. I'm part of the APN Ambassador Network, which is a new program that AWS has started up for consulting partners. APN Stands for the AWS Partner Network, and what they've done is get together a group of like-minded partners in one room so that we can kind of give feedback to AWS but also help them get new features, services, technologies out into a wider audience, you know. And so a big part of what we do is things like coming on this podcast, but also doing blogs, doing meet ups, giving speaking events and things like that. And so they really kind of encourage us and enable us to get the word out there and try and make AWS easier to use for everybody, not just consulting partners.

Jeremy: Great. And what about your background? Where do you come from?

Rowan: Yeah, So, I mean, I've been working in IT for longer than I'd like to admit now, you know, mainly with AWS, especially over my last four years here at Versent, just working purely with AWS. Before that, I was leading developer teams, working at some startups, things like that. And, you know, when the cloud came along, I really kind of jumped on board because I was sick of administering servers in the first place. And that's probably another reason why you see me online talking a lot about serverless stuff.

Jeremy: Awesome. Alright, so I have had a number of episodes where we've gone down sort of the technical path of some subjects. Some of them we've talked more about the business things, but I know that you do a lot of work with Step Functions, both as obviously as your role as a consultant, but also just I think just what you're doing out there working with teams and what you're seeing. So I want to talk about Step Functions today, AWS Step Functions. I think this is a really fascinating service that they offer. I think it's under utilized by a lot of people. I mean, it certainly has good use cases and use cases that may not be the best for it. But maybe so if people don't know, what are AWS Step Functions?

Rowan: Yes, so AWS Step Functions is the name of Amazon's state machine as a service offering that they have. State machines are sometimes called finite state machines, and this just makes them much more easily consumed and also integrated with your serverless applications.

Jeremy: All right, so let's let's start high level here. What are state machines?

Rowan: Yeah, so a lot of people get turned off by some of the terminology. But I mean, really, it's just a lingo or, you know, some vocabulary that's actually been around for quite a while. State machines are nothing new. State machine’s really just a mathematical way of modeling an application. Most people kind of understand what that means conceptually, but what it means in reality is you can describe your application as a mixture of your inputs, the states that your application can be in, and the transitions between those states. And what this forces an application designer or really a developer to do is to think really concisely and clearly about what their application does and what it could do next, and it kind of forces you to do that upfront, which I think is something that's really valuable and often overlooked.

Jeremy: Yeah, and one of things that I really like about state machines and specifically Step Functions in the serverless world is that ability to do function composition. Because I think that's one of the things that many people are confused about. Like how do I have function X talk to function Y, right? So state machines are the glue in between those, right?

Rowan: Yeah, definitely. And you bring up a really good point because we see a lot of people out there discussing on forums and in Slacks about like, “Should I have one function call another function directly?” and usually someone will jump in and say, “Oh no, you should never do that.” But obviously, then that's going to complicate things a little bit. And in some ways, if you're using Step Functions that problem goes away because you're able to link two functions together, you know, which allows each one of those to do one thing well, and you don't have to worry about calling them directly and coupling those two functions to each other. At the same time, you don't have to introduce things like SNS topics or SQS queues in between those functions. You know, in my opinion, it's kind of the best of both worlds.

Jeremy: Yeah, I mean, and that's where we're talking about orchestration versus choreography, right? So and that's one of the things that if you're using SNS or you’re using EventBridge or using some other communication channel or messaging bus, you can decouple the applications — state functions are a way of sort of creating coupling. But it's a different type of coupling, right? Because a function can run on its own outside of Step Functions or it can be part of several Step Functions, right? Different steps within that could reuse that same piece of logic. It's just a way of kind of gluing all that stuff together, like you said.

Rowan: Yeah, it is a form of coupling, but it's a loose coupling. You know, the function that is calling or being called doesn't know that it's being called by another function, to your point where it might be part of a complicated workflow or it might not. It really doesn't matter to the implementation function.

Jeremy: So what about the visualization of these things, too? That's another sort of important piece, right?

Rowan: Yeah, when it comes to the visualization, I think that's another thing that Step Functions provides that's really valuable, especially for developers getting started with state machines that maybe don't have a lot of experience with it, is it has a really nice way of rendering your state machines so that you can clearly understand how states are connected, what transitions are in play. And you know, this really makes things like troubleshooting and designing a lot easier. And it's this kind of thing, which, if you're trying to use state machines in your applications as I did, years ago, you had to do that yourself. There was no easy way. So generally what you found yourself doing was okay. I'll go over here and I'll design my state machine in isolation. I might use a diagramming tool, and I'll draw the circles and the lines that connect them, and then I'll go and implement that in code. But there's a little bit of a leap there. Well, I might think I've implemented it in code, but maybe I didn't. Maybe I got something wrong, and that's going to make my life later on a lot harder, because what I think it is and what it actually is is two completely different things. You know, there's a quote I love, which is, at the end of the day, state machines are a way of modeling your application, and we often say all models are broken, but some models are useless. Step Functions make it easier to get that model more, right. I wouldn't say correct, but it gets it a lot closer.

Jeremy: Yeah, and those are certainly some of the benefits that you see — just that ability to troubleshoot faster, like you said, and sort of encapsulating that application logic. Are there any other benefits though that sort of come to mind?

Rowan: Yeah. Look, I think with Step Functions, it really forces the developers to break up their applications into discrete steps — things that most developers already do in their head, it forces them to articulate it and write it down. And that obviously makes it easier for them to communicate that with another developer, and that other developer might be them in six months time. And so, by forcing them to break these things down, you can see where there's a lot of parallels with serverless applications in general. We talked about having functions do one thing and one thing well, and at the end of day, I guess what this is forcing developers to do is make all of their implicit models that they have in their head and really make them explicit and define them and say, “Well, you can only do this thing after that thing” or something like that.

Jeremy: And so the other thing, too, is I think people look at Step Functions and they think that they're extremely complex. But really, they're not actually that complex. I mean, they're powerful, but from a complexity standpoint, they're quite easy to implement actually.

Rowan: Yeah, if you look at most Step Functions in the console and you can see that visualization of them, they are really quite simple. There's a built-in limitation to them that I think makes them quite powerful because it really forces you to do what you said you're going to do and not kind of make things up on the fly because it just won't work. And I think a lot of people are turned off this perceived complexity, partly because of the vocabulary that's used. Like I said, traditionally, they've been very academic and they've been around for a long time. So people kind of look at that and go, “It's computer science. I don't have time for that,” or something like that. And really, it's not that bad. And like most things, if you start off simple, you have some simple use cases in mind, you can then iterate on that and build up to high levels of complexity, and there's been some nice features released that can really help you with that. At the end of the day, I wouldn't say Step Functions are required for every serverless application out there, but they really do suit themselves well to complicated workflows things that need a higher degree of traceability, or auditability. For some of those really important workloads, they can really help you, give you confidence in the system.

Jeremy: Right. Alright. So let's get into a little bit more details here. So you mentioned input states, transitions, that sort of stuff. So let's talk about sort of the state's tasks, activities, those sort of things as part of the, well, I guess what we would call maybe the core functionality of Step Functions. So why don't we start with that?

Rowan: Yeah, sure. Yeah.

Jeremy: So if you give us an overview of that, that'd be great.

Rowan: Yeah. So, look, most of what I'm going to tell you about today is defined in the state’s language spec. I think it's states-language.net/spec.html. This is a document produced by AWS, but if you actually read it, there's nothing specific to AWS in there other than the name of it. And really, when you talk about the core functionality of Step Functions, you're going to talk about an execution of your state machine and an execution has an input, and that's just a JSON payload that comes into it. And then once you're inside that execution, your state machine is going to have a number of different states, and those states can transition to other states as you've defined in your state machine definition, which at the end of the day is just a JSON object. The most interesting part of the state machine is going to be those states that you can have and in Step Functions terms, we talk about these different kinds of states that you can have that you can define using the language. The main one is called a task state, and as the name suggests, this is where you go off and do something. And then there's a few other kinds of states that you use to kind of control how your Step Function flows. So you have a choice states that can direct you down to different following states.

Jeremy: And those choice states can all be done — those are all based off of the output from the previous state.

Rowan: Correct, correct, because every state has an input on its own that it uses to then do something with and again, this is kind of the attraction for me with state machines, is that you know what it's gonna do based on its inputs, and it forces you to kind of define that up front.

Jeremy: Right.

Rowan: Some of the other states that you can use — that you will have to use — is the end states as we talked about, and these are special because you can't go to another state after them. And you have two kinds, as you might if you think it through. You’ll realize you have a fail state or a succeed state, and this is just about reporting the status of that execution back up to Step Functions so you can see it in the console, wherever you're monitoring it.

Jeremy: Okay, makes sense.

Rowan: Another state that you can have is what's called a pass state. And this is a relatively simple state that you use to either inject certain values into your input and output payloads, or often you'll use it for debugging just to try and, you know, kind of explain, okay, why are we transitioning from this state to this state and what is that being done for? Another state that's often used is called the wait state, and this is where you might build in a delay into your system and again, because you're forced to declare everything, you actually have to tell whoever's looking at your execution, “Yep. I'm just waiting for something now.” You know, it's not kind of hidden away inside the code.

Jeremy: But actually, the wait state is is incredibly powerful because there have been several people, I think Paul Swail and Yan Cui and a couple others, have used that wait state as a way to set a like a timer — like a dynamic timer, basically — you know, so that if you want to execute, I know some email or you’re scheduling emails or something like that, you could create all of these different wait states in order to schedule a job later in the future, and it's much more reliable than something like a DynamoDB TTL or something like that.

Rowan: Totally. Yeah, and you know, this is a really good example of how a state machine is made up of these relatively simple states, but when you combine them together in this way, you can come up with some really complicated workflows that are really powerful, yet still easy to understand.

Jeremy: And I actually think that's one of — speaking of combining things — the other really cool thing is obviously parallel branches, right? So what can you do with that?

Rowan: Yeah, so look, this is another one that I find myself using a lot where, just as you would if, you know, you as a human were performing a task, you might have a couple of different things going at once and again, this is where you can really leverage some of the benefits of a serverless platform and the serverless approach, what makes it really easy to parallelize your workloads and really get some time and efficiency, time savings and some efficiency, out of it.

Jeremy: And what about this new thing “dynamic parallelism” that has just come out. Any thoughts on that?

Rowan: Yeah. Look it's a slightly more advanced feature. You know, it's been out for a couple of weeks now and we're using it. We've run into a few issues using it at scale, so it is still relatively new, but it does kind of fit nicely with how I think a lot of users really wanted to use Step Functions. I think I saw a tweet the other day saying how this was like the most requested feature for Step Functions. So it's really cool to see them adding that to the language and really kind of putting these additional advanced features in.

Jeremy: Cool. Alright, so let's talk about tasks because I think of all these other things — choice states and pass states and all those other things that you can use — really where the work gets done is in these individual tasks, right? So probably, I mean, I even when I think of it, I think 99% of the tasks I use when I use Step Functions are Lambda functions, right? Basically, I'm calling some discrete piece of business logic and doing that. So that's obviously the most common. But there are a lot of other services you can integrate with, right?

Rowan: Yeah, definitely. You know, I think you're right. Lambda is by far the most used. Obviously you can do anything in there within a 15 minute time limit, but there's a lot of other ones now that are really useful and especially some of these more advanced callback patterns coming in, they're seeing more and more use. The ones I use a lot are things like the SNS task, which allows you to send off notifications, and this could be great for things like Slack integration, kind of letting people know whereabouts in the state machine the execution is. Another good example is using Fargate or ECS to do some longer-running container base tasks. And probably my favorite one, which has just been announced relatively recently, is actually calling out to other Step Functions. This is a really nice model where it allows you to kind of encapsulate some complicated workflows in another state machine and from the parent state machine, you don't need to worry about those details. Before this feature was released, you used to kind of have these nested state machines, and they did look pretty complicated when you kind of opened them up. Now you can kind of collapse that all down to just a single state and just say, “You know what? Go off and do that job. Don't worry about telling me the details. Just let me know when you're finished.”

Jeremy: Yeah, that's a very powerful, and, of course, recursion in code, especially if you can write a recursive function, they can be very, very powerful, sometimes very complex, complicated, difficult to understand, but when you make it work, I know I always get excited when I'm like, “Oh, my goodness, it actually works.”

Rowan: Yeah, not without its challenges.

Jeremy: Exactly. So what about some of the long-running activities? You mentioned the callback pattern stuff, but that's relatively new. So what are we talking about with just long-running activities?

Rowan: Yeah. So in the past, you used to have to use something that Step Functions referred to as “activities,” and this was a way to kind of farm out that long-running work to workers that were usually going to be based on an EC2 instance, or maybe even in the external system. And, you know that involved a lot of polling. You know, those workers would poll for work and eventually work would appear in that queue. And I think this pattern is less relevant now than it used to be. You know, it used to be the only way to do this. But now with this callback pattern where you send off the work and you supply a task token and you say, “Hey, when you're done, let me know that this task is finished and then I'll continue on the execution.” And that doesn't involve any kind of polling or any kind of long running work. You can really get it down to the point where your workers only work when they need to, not just in case.

Jeremy: Yeah. I mean, and that's the other thing, too. Ben Kehoe had a great post about the task tokens because that's just one of those things where it was so inefficient to be polling every 30 seconds or every 10 seconds. And if it was an important job, you might be pulling every one second in order to check for something to be done. So anyway, I really do like that — that new callback pattern. It obviously is much more efficient. And what is it, like a year or something like that? How long can it wait?

Rowan: Yeah, so executions can sit there for up to a year, you know? And they don't cost you anything while you're doing that. Their pricing model doesn't care how long it's been running for. So yeah, really powerful for some of those really long-running workloads.

Jeremy: Yeah. I hope you don't have a task that takes a year to run, but if you did, you'd be okay. Maybe that sounds a little complex as we kind of talked through those things. I think, though, if you look at it and you go into the Step Functions console even and you just use the little visual builder and you can build some Step Functions yourself, I think these sort of makes sense. But let's go to the advanced side of things, because one of things we know, and I think I've said this 1000 times on this podcast already, but everything fails all the time, right? So something is going to break. It's not going to be your fault. It’s going to be a network issue. It's going to be can't connect a third-party API. It's going to be SQS hiccups or something like that, and it doesn't submit the job, and then something fails. So one of the really, really cool things, especially about complex workflows that Step Functions handle is the ability to do error handling and it’s sort of built-in for you. So let's talk about that a bit.

Rowan: Yeah. So the really cool thing about when you're configuring your tasks, you can configure things like, okay, how will this behave on a failure? And obviously the simplest thing is just fail and kind of cancel out the execution. But we're seeing more and more teams, if the response to a failure is something that's kind of predetermined, like oh, you know, I should try again. You can configure these things in your state machines really easily now, and it's a lot easier than if you had to write all that code yourself. Because if you've ever tried to do kind of elaborate error handling inside a Lambda function, it can get a little bit tricky as to what happens when and it often ends up being a lot larger than the actual code that's doing the work. And so this allows you to kind of take that out. And you can even use things like the choice state that we talked about before to say “OK, on a particular kind of failure, I'm going to go off and do this other stream of work and kind of address that problem.” And again, you don't have to put that in the Lambda task that's actually receiving the error. You can kind of hide that away from there, and that enables you to keep your functions nice and small.

Jeremy: Yes, so also, you have this idea of things like the saga pattern where you have, you know, maybe a job completes or some step complete successfully, and then it goes on to the next step. And then that step fails, right? Like you can't charge a credit card or something happens, and it has to go in reverse other states and that's all stuff that you can build in with Step Functions as well, and it gets complex and it gets complicated. But that is possible.

Rowan: Yeah, definitely. And as much as I always prefer a simpler solution, at the end of the day, sometimes you have to do those complicated ones because that's where the value is and, you know, being able to very clearly map out, okay, I've done these steps and it's failed, so now I know which steps I need to undo because I've just defined them in my state machine. So it becomes really clear and easy to troubleshoot. And again, if you have any failures around there, there’s some nice integrations in the console, for example, to see okay, if you have a Lambda task and it's failed, it will actually link you directly to the function that failed. It'll actually pull the exception out of the Lambda execution result and show that to you in your Step Functions console, and it'll even give you a link to the Lambda logs for that particular execution. So, as much as I don't want to be using the console regularly, if I am troubleshooting, it is relatively streamlined process compared to, like you said in those more complicated workflows, trying to correlate across different Lambda functions exactly what went wrong, where and why. It could really help for those kind of things.

Jeremy: Yeah, definitely. Alright. So we talked about the parallel state a little bit, and we didn't really get into too much details in terms of what you would do with parallelism. So what's an example of that? Because if you're thinking about Step Functions that this step, then that step, then that step and you’re passing data through those different steps. But what's really cool about when you run parallel jobs is that one, obviously, you can use it for things like fan out and some more complex sort of work flows like that, but Step Functions also aggregate all those results for you, right?

Rowan: Yeah, definitely, and it can work really well if you have different jobs that can be performed in parallel, but they might take different amounts of time to finish, and you can bring all those results back and then make some decision based on the aggregate results of those. And you might even say, “Hey, I can accept some failures in this process and I don't have to fail the whole thing.” And when you combine that with this new dynamic parallelism, it can, with the ability to nest Step Functions, you’re going to end up in a situation where you can have a really simple, kind of overreaching Step Function that can call out to lots of other Step Functions in parallel. And the potential is there for processing data, and things like that are going to be really, really cool once people get comfortable with these. You know, I didn't mention it before, but there's also integrations for tasks to call out to things like SageMaker and Glue and some of these really data-intensive services from AWS. And so being able to pull together these complicated ETL workflows is actually going to be quite simple to implement and maintain going forward, again, once people get comfortable with things like Step Functions.

Jeremy: Yeah, so then another sort of advanced thing, and maybe it's not that advanced, but I think it's a pretty cool is that when you are sending data into a particular Lambda function or into a task for whatever, you can actually manipulate the shape of that data. So let's talk a little bit about how that works.

Rowan: Yeah, sure. So this is another one of those features which, when you get started out with Step Functions, you totally don't need to worry about, you could just have input from into your state, then become some kind of output, whatever your land of function returns, and then that becomes the input for your next state. And that's a really simple model. And it works really well for most workflows. You do have to worry about the kind of total size. You don't want to put large objects in there. You know, that's where you might put a reference to something in DynamoDB or S3. But once you get to the stages where you may be doing things in a more complicated fashion, you can pick and choose what parts of the returned payload from your Lambda function or whatever service you're calling in your task you actually want to keep, and then you might only pass a subset of that on to the next state. Whatever it needs to do its job, and that way you can, using this kind of JSON path syntax, you start at a dollar sign that represents the root of your JSON object, you can just say “Yep. I want this particular property that will get sent to the next state.” Or you can even do it on the input passing as well to say, “Well, I'm going to give you a large object. But, you know, since I'm running a whole lot of things in parallel, you don't need the whole object. I'm just gonna give you a subset,” and then you can do what you need to do on that. And as you said, you can then reaggregate that at the end of those particular states.

Jeremy: Yeah, and this is super important when you're integrating with some of the other services, right? So if you want to send something to DynamoDB, you really want to be able to control the shape of that, or you need to interface with Glue or something like that. Step Functions will make the API call for you, but you still are responsible for the shape of that data.

Rowan: Yeah, another good example is when you're doing those notifications via the SNS integration, I only want to pass particular subsets of my input object to, say, the message field or the body of that notification message, so they don't need to see all the internals. I can actually send them what is relevant just to them. And that's really nice, because again, all of that complexity is hidden from the service that you're actually calling. And it's really in the place that needs to be, which is the state machine, which knows about all these things. That is kind of the core of your application.

Jeremy: And the other thing, too, so combining the data afterwards, so when you execute maybe a couple of parallel tasks, maybe you shape the data differently in terms of what goes into each one of those tasks, when that data comes back, you not only have access to what those return, but you also have access to the entire state of the application, right?

Rowan: Yeah, yeah, it gets pretty complicated, you know, when to trying to bring these bits and pieces all back together into a coherent message. But the reality is that complexity was already there. You just were kind of ignoring it and hoping for the best. This kind of forces you to think about it right at the beginning and go, “Okay, Well, what will I do with the various returned values?” And really kind of makes you do it now rather than after It all goes horribly wrong. And you have to, you know, pick the pieces up.

Jeremy: Definitely. All right, All right. So let's move on to some recommendations here. So you work with Step Functions all the time. So what are some of your best recommendations for people using Step Functions?

Rowan: Yeah, sure. One of the things I recommend to the teams I work with that a using Step Functions is to take the time at the initial state of their execution and really setting up their payload, you know? So this might be where you pull in variables from parameter store or something from databases, and then use that in your payload for all of the states that follow in that particular Step Function. And the reason why you do this is that it stops all of those other states making their own calls to those configuration services, all those data services. Now you know you can't obviously put things like secrets in there, but you can put references to where these things are so that you kind of minimize the touch points or coupling for those states that follow afterwards and everything you need to know. It's kind of like dependency definitions. You know, you do very much of the start of your execution, and that way it's there for all the other states to use. The other thing I really like about that approach is that kind of mimics, you know what you do in things like functional programming where you say, “Okay, the behavior of my code is determined by the inputs.” And so you're really calling out these are my inputs, and I find that makes it easier to test the various states in your Step Function, you know? So if you have a Lambda function in there and it's always gonna be getting the same inputs and everything it needs is in those inputs rather than in things like, you know, environment variables or external data stores, then it makes testing those functions in isolation a lot easier. So, you know they're gonna work in the context of the state machine that it will eventually exist in. As I mentioned earlier, you know, I think it's really important to bubble up errors into the state machine as much as possible. So in the case of Lambda, this is where you know you don't want to put too much complex error handling into your Lambda. Obviously, some make sense. But at the end of the day, if the function has a problem, it should just let the state machine know that “Hey, I've got a problem.” And that's gonna make trouble shooting quicker and easier because you gotta find the call problem in a short amount of time. The other thing, which I know I don't do enough off and I see people kind of learning the hard way is setting sensible timeouts on your states, you know, so particularly for these kind of callback based tasks, if there's only a reasonable amount of time that it should take to run and it could be hours, days, whatever, then you should call that out in the actual state definition. You know what I generally sees it in development, it's fine because you're watching every single execution. And if something goes wrong, you see it and you fix it. Once you've deployed this and into production and you're not looking at every single execution or maybe you can't even see it every single execution, that's where timeouts will really save you. You know and kind of let people know that, “Hey, there's something going wrong here.” Another thing that's really cool about breaking your application up into these discrete states is especially in the context of Lambda tasks, it allows you to set an IAM role just for that specific task. So rather than saying, “Hey, this is a role that my application needs to run.” You can say, “well, this particular part of my application only requires these kinds of actions, and resources on these resources”, and it allows you to define that they're rather than giving, you know, that kind of a common set of permissions to your entire application. So this is really kind of the best possible least privilege scenario that I think you can get.

Jeremy: Yeah, I'm a huge fan of least privilege. That's my my mantra. That's why I live by.

Rowan: Yeah, look, and it's definitely something that, you know is the responsibility of developers these days because they're the ones that are writing the code. And so anything I think, which makes it easier for them, is a good thing.

Jeremy: So what about what about metrics, though? Getting metrics out, right? Because the observability is one of those things where it's possible to do it. But Cloudwatch is a good place, you think, to get those metrics?

Rowan: Yeah, definitely. Like most AWS service is you get a whole bunch of default metrics that will come out of there in terms of number of executions, how long they took, the various states they were in. And so I think that's a really good place to start with your monitoring of of Step Functions. You know, obviously, CloudWatch dashboards has its limitations. But in lieu of any other kind of monitoring solution, it's a really good way to get some basic visibility, cause that'll let you identify any kind of anomalies. You know, if most state machines you have run in a matter of minutes, and you've got one that's going for days, should probably have a look at that.

Jeremy: Definitely. So now that EventBridge is out too, you have the ability to trigger Step Functions from events. I mean, you could technically do it with CloudWatch events as well. So is that something you recommend using that as a way to invoke them, as opposed to maybe invoking them directly from a Lambda function or something like that?

Rowan: Yeah, definitely. You know, I think there's a lot of kind of scheduled tasks that will suit themselves really well to being defined in Step Functions and then executed on a regular basis. Whether that's using scheduled events or I see, you know, using Event Bridge as a really good way to decouple the request for an execution from the actual triggering of that execution. So the thing that wants it could just talk to EventBridge and say, “Yep, I need this job done”, and then that can trigger the Step Function that can then later on, kind of return its status. Whether it's you know, via the various other kind of integration service integrations and you've really decoupled, the kind of the request from the performing of the work, and that's gonna work better in a serverless and an asynchronous workflow in the long run.

Jeremy: Now, what are your thoughts on just this idea of service boundaries and where Step Functions might fit in? Because for me, I always look at it to say if I'm building a user service and I'm building a billing service, and those are two separate things. I may have workflows within each of those services that require several steps, and I need to make sure they all complete. And I would definitely use Step Functions within those services, but what about coordinating across multiple services, right? So would you have a Step Function that says, “all right, I'm going to run a task that processes the inventory for this product, and then I'm gonna have another that's gonna go to another task that charges the credit card or whatever those different steps are.” Just what you thought of thoughts about using Step Functions across these service boundaries?

Rowan: Yeah. Look, I think it's definitely something that can make sense. But the caveat being that it needs to be done in a kind of a simpler way as possible. And so that's where this new nested Step Functions approach could do that kind of processing, I think the key thing that anyone developing a serverless application needs to remember is you do need to plan for failure, even when using things like, Step Functions. You know? Definitely. As as you said earlier, everything fails all the time, so as long as your system can handle, retries across those multiple systems, yeah, I think it can definitely work. And I really hope that as developers get more used to these tools, they'll be able to find new and interesting ways to apply them in hopefully ways that make everyone's lives better.

Jeremy: Yeah, definitely. Um, so just maybe one more question about long running tasks. So you don't recommend using activities anymore? Those even still available? I haven't used them, I haven't used them in a while, and I actually haven't tried the callback stuff yet, but so what are your recommendations on that?

Rowan: Yeah. Look, activities aren't gonna go away you know, AWS doesn't get rid of services very often, they just kind of stop mentioning them in their updates. Look, there might still be occasions where activities make sense, especially if you already have those long lived workers, and you really just want to control them and orchestrate them. That's where I can definitely see it making sense. Perhaps if you also have a worker that is not as well integrated with AWS and can't necessarily return that task token itself, although it's gonna have to be talking to AWS to get the jobs in the first place. But, you know, there may be some limitations there if your worker can't handle that. Other than that, you know, I think that the waitfortask token, the callback pattern is really just a much simpler and more elegant approach that's hard to beat.

Jeremy: Yeah, all right. So I actually have one more question, and that's on pricing, because I think pricing is something that turns a lot of people off to Step Functions because they're not cheap, right? So I mean something like two cents per 1000 or 2 and 1/2 cents per 1000 transitions. If you have nested Step Functions and you have all of these different workflows running through them, that can add up pretty quickly. What are your thoughts on pricing?

Rowan: Yeah, look, so it's, you know, if you look at the raw numbers, it's definitely not the cheapest. But what I found in practice is that it is pretty forgiving. You do have to be aware of it. And, you know, if you had a really high volume workload, maybe you deliberately try and not put Step Functions in the mix there and instead maybe use it at a higher level to, say, coordinate some of that batch processing rather than having a state or transition for every single processing activity that you do. I know for myself, you know, I definitely have learned the hard way. I misconfigured how my Step Function was being triggered and came in the next day and found that it had been triggering itself consistently for the last 20 hours and run a couple of thousdnd executions. And luckily, you know, the free tier is pretty generous. So I think it cost me a couple of dollars, but, you know, there's, I think, you're probably gonna have a lot more issue with costs around service is like API gateway and some of those things than you are Step Functions, unless you're putting it in your kind of high volume processing. And maybe don't do that.

Jeremy: Yeah, I mean, I think that's the biggest key is sort of like, if you're using it to process like clickstreams, that's gonna get expensive very, very fast.

Rowan: Yeah, definitely. You know, and there are other tools that are probably more ideally suited to that. I see Step Functions fitting into a much more kind of generic kind of higher level workflow rather than something quite as low level as clicks.

Jeremy: Yes, definitely. All right, well, so, I guess let's wrap this up. What would be your advice? You know, for people that have not used Step Functions yet? What's the best way to get started with Step Functions?

Rowan: Yes, I think going to the console, there's a lot of sample projects there. In particular, they have the callback pattern listed there. You know, I think most people, if they sit and think about that, they can probably come up with one or two workflows that they’d like to have automated. And you just have a try at representing that as a state machine. You know, the JSON language that's used to describe them in the language spec is really simple. You do have to get used to a few of the terms. You know, some of these various states and tasks that we've talked about here today, but it gets familiar pretty quickly, you know? So don't be turned off by the language.

Jeremy: Awesome. All right, well, listen, Rowan, thank you so much for being here, sharing all your knowledge with the serverless community. How can listeners find out more about you?

Rowan: Yeah, I do most of my work, probably through my blog, which is just rowanudell.com. I'm on Twitter occasionally as well. Spend most of my time talking about AWS stuff there. Other than that, you know, I've written a book on AWS, The AWS Administration Cookbook, which is actually about to get its second edition. I'm not writing the second edition, but you know that should be out in the coming weeks. Yeah, that's me.

Jeremy: Awesome. All right. Well I will make sure that we get all of that into the show notes. Thank you so much.

Rowan: Thanks, Jeremy