Episode #39: Big Data and Serverless with Lynn Langit
In this episode, Jeremy chats with Lynn Langit about why big data is outgrowing traditional systems, how bioinformatics and genomics are generating the biggest data scale ever seen, and why serverless and the cloud are making it easy for researcher to process this data faster and more economically.
Lynn Langit is a Cloud Architect who codes. She's a Cloud and Big Data Architect, AWS Community Hero, Google Cloud Developer Expert, and Microsoft Azure Insider. She has a wealth of cloud training courses on Lynda.com. Lynn is currently working on Cloud-based bioinformatics projects.
- Twitter: @LynnLangit
- Site: LynnLangit.com
- Courses: https://www.linkedin.com/learning/instructors/lynn-langit
- GCP for Bioinformatics: https://github.com/lynnlangit/gcp-for-bioinformatics
- Genome Engineering Applications: Early Adopters of the Cloud by Jeff Barr
- Scaling Custom Machine Learning on AWS
- Scaling Custom Machine Learning on AWS — Part 2 EMR
- Scaling Custom Machine Learning on AWS — Part 3 Kubernetes
- Shopping with DNA
- Learn | Build | Teach
Jeremy: Hi everyone I'm Jeremy Daly and you're listening to Serverless Chats. This week, I'm chatting with Lynn Langit. Hi Lynn. Thanks for joining me.
Lynn: Hi. Thanks for inviting me.
Jeremy: So you refer to yourself as a coding cloud architect. You're also an author and an instructor. So why don't you tell the listeners a little bit about yourself and what you've been up to lately?
Lynn: Sure. I run my own consulting company. I've done so for eight years now and I work on various projects on the cloud. Most recently I've been doing most of my work on GCP because that's what my customers are interested in. But I've done production work on AWS and Azure. And I've actually done some POCs now on Alibaba Cloud. So one of the characteristics of me and my team is that we work on whichever clouds best serve our customers, which makes work really fun. In terms of the work that we do it really depends on what the customer needs because I have this ability to work in multi-cloud. Sometimes it's me working with C levels or senior technical people helping them to make technology choices, so based on their particular vertical. But at other times I'll hire a team of subcontractors for a particular project and we might build a POC. We might actually build all the way to MVP for a customer.
Lynn: And then occasionally I take projects where I build all the way out. The longest one I've had over the past few years is I did a project for 14 months where we went from design all the way out to product. And I worked every single day I was embedded with the developer team. So I do everything from design to coding to testing. It's a fun life.
Jeremy: It sounds like it. Well, so listen, I have been following you for a very long time and I'm a huge fan of the work that you've done. I've watched some of your videos on LinkedIn Learning and just been following along with some of this other stuff that you've done. And really like you said, a lot of what you have done has been around big data and recently you've been getting into, or you have gotten into, big data and serverless. And that's really what I'd love to talk to you about today because I just find big data to be absolutely fascinating and just the volume of data that we are collecting nowadays is absolutely insane. It's overwhelming.
And I don't know if traditional systems or if especially smaller teams working on some of these specialty products have the capability or the resources to keep up with the amount of data that's coming in based off of sort of some of these traditional methods to do that. So we can get into all of that. And I have a feeling this discussion will go all over the place, which is awesome. But maybe we could start just by sort of level setting the audience and just explaining what big data is or I think maybe what you mean by big data.
Lynn: I can have a really simple explanation. I'll say the explanation and I'll tell you why. So the explanation is data of a size that doesn't function effectively in your SQL Server or your Oracle Server or your data warehouse, so your traditional systems. And the reason I say this is because that is my professional background. I've been doing this for about 20 years now and for the first five or so maybe seven, I was working in those systems. I've actually written three books on SQL Server data warehousing. I worked for Microsoft as a developer evangelist back in 2007 to 2011. And the consulting practice that I built initially was around optimization of relational database systems.
So I was literally working on systems and figuring out, oh, this could be optimized. Let's optimize it. Oh, whoops, we have too much data now, what do we do? So when I left Microsoft in 2011 to launch my consultancy, I left because I was so fascinated by what was coming beyond these systems. One of the impetus was the launching of Hadoop as an open source project. And literally when I left Microsoft, I went to New Jersey and I took a class with Hadoop Developers, which was really throwing me in the deep end because I had come out of the Windows ecosystem. Of course the class was on Linux in Java, all coding. And I learned a lot that week.
Jeremy: I can imagine, yeah. So that's maybe my question there. So big data is this volume of data, this immense amount of data that's coming in that I think as you put it, that sort of these traditional systems like a SQL Server or even an Oracle can't handle or at least can't handle at a scale that would make the processing easy. So you mentioned Hadoop and there's other things like Redshift is now a popular choice for sort of data warehousing. And then you've got Snowflake and Tableau and some of these other things I think that are ... products out there that are trying to find a way to analyze this data. But what is the problem with these traditional systems when it comes to this massive amount of data?
Lynn: Well, it goes to the CAP theorem, which is consistency, availability and partitioning. This is sort of classic database ... what are the capabilities of a database? And it's really kind of common sense. A database can have two of the three but not all three. So you can have basically the ability for transactions which is relational databases or you can have the ability to add partitions is really kind of to simplify it easily. Because if you think about it, when you're adding partitions, you're adding redundancy. It's a trade off. And so are you adding partitions for scalability? And so when adding partitions makes a relational database too slow, then what do you do? So what you then do is you partition the data in the database to SQL and NoSQL.
And again, I did a whole bunch of work back in 2011, 2012, 2013. I worked with MongoDB, I worked with Redis. And one of the sort of key, I don't know, books I guess, would be Seven Databases in Seven Weeks. It's still very valid book even though it's many years old. It tells how you do that progression and really turn the light on for me, because prior to that point it was, oh, just scale out your SQL Server, scale out your Oracle Server, which still would work but these NoSQL databases were providing much more economical alternatives. And of course I'm always trying to provide the best value to my customer. So if it wasn't a great value to buy more licenses for SQL Server or for Oracle, rather you want to get a Mongo Cluster up or a Redis Cluster up, you could partition your data if that was possible because there's cost to partitioning your data and writing your application.
So I just found those trade offs really, really fascinating. And of course during that time, cloud was launched, led by AWS. Microsoft had an offering, but they didn't really understand the market until a little bit later. So Amazon had an offering and they first started, it was really interesting. They started by just lift and shift with RDS at a PaaS level taking SQL Server and actually making it run effectively in the cloud. That was how I got started, because my customers wanted to lift and shift and maybe go to an enterprise edition and run it on cloud scale servers.
And ironically because they're kind of co located some former Microsoft employees who were kind of frustrated at that time with what Microsoft at that time was doing with SQL Server in the cloud went over to Amazon. Notoriously right after I left at SQL PasS Summit presented on Amazon SQL Server RDS. And PaaS Summit was kind of a Microsoft centered event and the Amazon people came over. And because of that, I kind of to this day have a pretty good relationship with the Amazon data services teams.
Jeremy: So then this idea of moving away from those systems, so we have, you mentioned NoSQL, or NoSQL. So we have a DynamoDB in the AWS ecosystem, you have Mongo, you have Cassandra, some of these other things. And all of them, maybe less so I guess with DynamoDB, but had some scaling problems sort of built into them. But this is something where I think you started looking at serverless tools to try to handle that big data. Now that thing is like data lakes and S3 or something like maybe BigQuery or Cosmos DB, what tools are you moving to, now to handle that scale of data?
Lynn: Well, change comes slowly and change is usually induced by some sort of pain. And so the pain in my case and my customer's case was through IoT data because IoT data increased the amount of data exponentially because the event based data. So I had some customers, some of the big, big like the biggest appliance manufacturer in the United States. Customers, I can't name, but you can guess who they are. And this was maybe eight years ago, so it was still a while ago. They wanted to IoT enable their devices.
And again, to be very clear, the majority of the enterprise applications that I would work with would be SQL plus NoSQL because they would have a need for transaction. And again, that's really important because I saw those startups go just directly to NoSQL and then they would call me and they would try to tune their transactional consistency of their Mongo and it would be clustered and it would be a mess. And then we just pull that out and put it in MySQL. Just the whole space was super interesting. So meanwhile the cloud vendors are evolving and Amazon of course comes with DynamoDB. And I have to tell you that initially I was super resistant. I was like, how do you even query that? I actually did some time tests and blogged about ... this is like seven, eight years ago.
You write SQL query, everybody knows how to do that. You write a Dynamo query, it takes 15 minutes because you have to research the query and how much is that in your dev time and dah, dah, dah, dah, dah, dah, dah, dah. So there was resistance including me. The service on the cloud that really stunned me and still does is BigQuery because BigQuery offered SQL querying, which I think it's extremely important when you're evaluating different kinds of database solutions to look at what is the ramp up time to understand how to get data in, how to take data out. And the more different the query languages are, the more errors you're going to have too, and this is your data. So I've seen a lot of bad things where developers overestimated their abilities. And because the query languages were at really idiosyncratic or esoteric for the NoSQL databases, it was all kinds of problems.
But BigQuery's idea of, okay, you get around the scaling problem by using files and you then just use SQL and you just pay for the time. I mean, I literally, I got goosebumps. I was stunned when BigQuery came out. I was stunned. I really got it from the beginning. And I've written about it, I've used it. I would often add it for customers as sort of a incremental rather than NoSQL. I called it kind of NoSQL going all the way back to SQL. Right?
Lynn: But to this day, there's still a lot of people that you show them BigQuery and they just don't believe it because you go to the query window as you know and you say ... Well, now it's more common since Amazon has Athena and I don't know, Azure has something like that. But even two, three years ago I was at a serverless conference and I was doing a talk on serverless SQL, and at the end, I do live queries on BigQuery on terabytes and I explain how it just costs pennies and as you obviously know, there's no service set up, there's no scaling, there's no clustering, there's no whatever. And people just wouldn't believe it.
Jeremy: Right, yeah.
Lynn: I was like, really. So I think it's really interesting that BigQuery, the biggest problem for a long time was that once you gave it to customers, what would happen is they would spend too much money, which means they'd love the product. And it was so interesting interacting with Google and saying, "Okay, I know you use this internally and so you guys aren't ... you're really used to cost controls and stuff, but customers need to understand what they're going to be spending. So you need to put cost estimates and cost caps and all that." And frankly that's just been coming in the last couple of years for BigQuery. And I think that that really hindered adoption, really a lot.
And it's interesting to see this pressure between cloud vendors because Google on some of their other services, like their VMs, they'll put their pricing and now that's pushing Amazon right in the console. So when you're setting it up you can click and size it and you get a more CPU use or whatever, you can see how much it will be. And I noticed, because I just did a refresh to one of my Amazon data courses for LinkedIn Learning, that Amazon is now putting this in the console. And I think that's really great. I love that.
Jeremy: Yeah. I mean, one of the things that I think is a good criticism or is a common criticism of services that are pay as you go is this idea that you don't really know how much you're going to spend until you start using it. And then you get all these benefits of not paying for maintaining some level of scale so that you have that availability. But then if you do sort of start using it heavily, then those prices go up. And as someone who is deep into the AWS ecosystem, I've used Athena quite a bit and when it first came out and I first started using it, I was very nervous thinking to myself like, wow, if you're searching through a terabyte of data and it's going to cost you $5, how many times do you run these queries and so forth.
But one of the things that I found with that, and I think it's very similar with BigQuery, is that these are all very, very optimized the way that they search. So you can limit them by date and you can limit them different ways. So you're not necessarily searching through terabytes and terabytes of data every time, especially if your queries are optimized in the right way.
Lynn: Yes. But you have to know how to do that.
Lynn: So what I have seen now that I'm working with customers with enormous amounts of data, this actually happened and it wasn't related to me showing them BigQuery, but this actually happened at one of these customers. They got very excited and they got a BigQuery bill of $83,000 in one day.
Lynn: Now this customer, because they're one of the biggest users of GCP in the world, they have the relationship where you can kind of have ... this kind of thing can be forgiven occasionally. But the greater point, and this is very germane to serverless services in general, is that there is a sliding scale. So let's take Amazon. Lambda basically costs you nothing until you're Spotify level, but what does cost you? Well, Dynamo costs you.
Lynn: Well, Athena costs you. And so it's an entirely new pricing model that I find my customers are just utterly confused by. And so part of my advisory is when customers are moving into the serverless platforms is they have to repurpose some of their team members to work with cost estimates and cost management. And if they don't do that from the beginning, then they're not going to get the level of value from the serverless services that could be had. That is, it doesn't have to be a full time role, but it has to be a role. I can't tell you the number of times that I've had these dev teams, oh, well we don't say the bill. They call us if we go over a certain amount. I said, "That's just irresponsible."
Lynn: In the old box software on-prem days you had licensing specialists, so what are these people doing now? In fact, again, I don't mean to be constantly plugging my courses here, but I try to make courses around needs in the industry. So I made a course called AWS Cost Control. And I often recommend it to students that reach out to me who maybe come from a finance background or come from some other background that are moving into cloud computing because I say this is a skill area that I find not covered. I think it really actually hurts the adoption of serverless actually because you sort of get the, oh, I got the shock Dynamo bill or I got the shock BigQuery bill. And again, this is just part of using the serverless services properly.
Jeremy: Right. Yeah. And actually you mentioned the cost, knowing how to use BigQuery or Athena correctly to optimize for those costs. The same thing is true with DynamoDB. I mean, I think a lot of people use DynamoDB in a way that makes it very, very expensive as opposed to storing data in different ways and having the indexes optimized the right way and doing queries instead of scans and other ways that they can optimize that. All right. So let's talk a little bit about sort of where some of this big data is going because I think we've got some serverless solutions and I guess we could get into some of that a little bit more. But I'm really interested in this sort of rapid increase in data, you mentioned IoT, but you're seeing a really, really big growth in data in a different sort of industry now. Can you explain that a bit?
Lynn: Sure. About three years ago now I had a couple of things in my life that changed my professional life. The first was my then 17 year old daughter went to Stanford Summer Program and she was at that time interested in bioinformatics. So she got enrolled in regular Stanford level classes and she was doing some bioinformatics. And I was thinking, oh, big data I'm really interested. I've heard this is a lot of data. And she came home and she said, "Mom, these people don't use the cloud." And I said, "What?" And I said, "Aren't you using Docker?" And she, "No, no, we have to make tiny data sets on our laptop or we have to SSH into the mainframe." And I was like, "Are you joking?" This is Stanford, so I was shocked. So that was the first thing.
The second thing was a very close friend of mine got breast cancer and because of some other work I had done in the bioinformatics community in San Diego, I knew there were lots of immunotherapies that were becoming available and she was not able at that time to get her tumor sequenced or even participate. And she had a very, I would almost call it dehumanizing course of treatment. And she recovered, she completely recovered, but it was just unnecessarily horrific. And at the same time, the third thing, Google came to me and said, "As a Google a partner, we have this interesting data challenge with the bioinformatics company or group. They are really interested in using GCP and starting to move their on-premise methods for research into the cloud. Can you work with them to use Docker and some of the services."
I said, "Yes, I can." And so I did and I had so much to learn and I found that the group it's called the Galaxy Project. It's a consortium, mostly based in Europe that does a genomic analysis workflows for research that will discover immunotherapies. They had a conference in Australia and I had some personal interest in Australia at the time so I called Google and I said, "Can I go?" So I went from, this is interesting to presenting it at an international conference and being the only non PhD there, which was, oh my gosh, it was intimidating, but I did get it to work because I do have the dev ops sort of chops and team. So a live dockered up little galaxy cluster on GCP, which was kind of a cool feat because at that time there were no GCP data centers in Australia.
So props to GCP, I was running it out of Singapore and I did it live. I tried to go into the world of the bioinformatics people. I participated in a five day training on the tool. And I told them why I was there. I just was honest. One of the things that I do on all the crazy stuff I did in my career when I go kind of out there at the end of the ladder or whatever, I always say default to honest. So I just tell people, I tell a story of my friend Terry, and I tell them why I'm here. And I tell them, I'm a new student in bioinformatics and I'm going to watch the Illumina sequencing videos at night and I'm going to try to make a contribution to what we're doing, but I'm a new student.
Jeremy: Well, I think it's amazing when especially the fact that you had this opportunity to use your powers for good. And so I want to get into exactly the scale of bioinformatics because this is something that I didn't even realize it was this big. Until you and I started talking earlier, I was sort of like, hey, I know there's a lot of big data like genomics, that kind of stuff, that's big data. But I did some research on this and I found this article about this system was trying to do a RNA sequencing and there was something like 640 million reads that they had to do. And it used to take I think 29 hours for this to work. And this is where serverless comes in because they took Lambda functions and they were able to optimize this down to where it used to take 29 hours, now it takes 18 minutes and it costs them $2.82 cents to do that.
Lynn: Yes. Speaking of serverless and genomic scale data, which is really a super set of big data, continuing my story of Australia, this is how I moved into serverless genomic solutions or serverless solutions for genomic analysis. At that conference where I presented, there were some researchers from the equivalent of the National Science Foundation of Australia. It's called CSIRO, the Commonwealth for Scientific and Industrial Research Organisation. So interesting, they had this burstable search problem, genomic scale to find where the edit points in a genomic sample would be. And the edit points would be for CRISPR-Cas9 editing so that you can then try to develop immunotherapies. So you want to have a fast feedback loop so you can go, okay, I could cut here, I could cut here. And the reason you have to have that is because when we think of DNA, when we non biologist think of it, we think that it's like 3 billion letters all in line, like a big road, like a straight road.
But it is not, DNA is coiled and curled and clumped. So just logically, if it's all smooshed together, it's hard to find the precise cut point. Now there's more scientific properties than just smooshed together. But that's basically, there are a set of properties that help to determine optimal cut points. So these guys in Australia, they went to an AWS Summit in Sydney and they saw the sort of classic serverless pattern for burstable websites. So they saw S3 with Dynamo and they saw Lambda with API gateway and they said, you know what, we're running out of room on our shared on-prem cluster, let's just try this, let's just try to build something. And they did. And it was published by Jeff Barr as one of the first all serverless genomic applications. And the applications is called GT-Scan2 and it's on Jeff Barr's blog and we can put the link in here, and it's just a classic architecture. And they built it really fast. And it was up and running.
So I met them in the conferences in Melbourne and they were there in Sydney, and I was going to Sydney for something else. And I said, "Can I just come and talk to you about this?" And they're very smart researchers. They researched me and they said, "Yes, you can come and help us with a challenge we have on this. You're going to get here for free, right?" This was a couple years ago now, like three years ago. I was kind of in the beginning of my journey to genomics and serverless. So we sat down and they said, "This runs, but sometimes it bottlenecks." And I said, "Oh, what are you doing about your logs?" Because of course serverless applications, is all about the logs, right?
Lynn: And they said, "Well, we're not really familiar," because they had never really done serverless before. And at that time Amazon had just released GT-Scan, like literally the month before. And so I looked like the complete hero because I said, "Let's take a look using GT-Scan. Let's instrument this." And bam, one day, one Lambda, was causing 80% of the problem. They reroute, fixed the bottleneck and we publish that too.
Jeremy: So that's one of the things though I think that's really interesting about, again, moving from this on-prem, you said they're running out of space in their on-prem solution. And I think that's the problem with all on-prem solutions is eventually either you're going to keep buying more hardware, buy more hardware, or you're going to run out of space. Or you're going to run out of compute power, which is another thing, which is why you see some of these examples of Hadoop jobs running for 500 hours to try to process some of this stuff. So that's another limitation I think that that serverless overcomes when it comes to data at this scale is for small teams, especially research teams that don't have huge budgets, that can't run massive clusters by themselves. And even if they could would still have to wait hours and hours and hours and hours to get feedback on the work that they're doing. So how does serverless and the cost reduction help sort of the smaller research teams?
Lynn: Well, again, continuing on working with this team because they were a really small team at that time. They've subsequently got more people and stuff. They said, "Okay, great. Good job, coding cloud architect. We have another challenge for you." It was a big challenge given my skill level at the time. And it was one of those 500 hour problems. They had written a customized machine learning library for analysis of genomic variants or differences between in the disease sample versus the reference. And it's called VariantSpark and written in Scala to run on their internal Hadoop Spark Cluster. And again, they had to wait for time on the cluster and then it literally did take up to 500 hours. This is a 500 hour problem. And they said, can't you do something on the cloud?
Okay, here's the tricky part. The computation was machine learning and stateful. And they were really new to cloud and they had a really small team. At that time they had no dev ops people. So exactly the thing you're talking about. So one of the things that was really interesting because it was very much a process and it ended up ... because I didn't do it full time. And we solved it. We got it down to 10 minutes with serverless. But the process is the key. I actually wrote this up on medium.com there's a series of three articles. So the first step in the process was they hadn't even used Docker, like at all. And so what we did first is we didn't even look at VariantSpark because it was too complex and machine learning and probabilistic. We took a bioinformatics tool that is deterministic.
It really doesn't matter. It's called BLAST, which is binary local alignment sequence or something like that. It basically just is the first level of analysis and it's a single executable. And we just put that in a Docker container so they could have one input, a process and one output and see how that scales, and then do that on-prem and then do that on the cloud because you have to have this ladder to learn. This is, again, a really common situation I find with customers moving to serverless that have experience in sort of traditional enterprise; BMs are on-prem clusters or whatever. It's really hard to go directly, really, really hard even for new implementations. So then the next level we did is we did EMR just almost as a lift and shift so that they could learn best practices like CloudFormation rather than clicking on the console. You know what I mean?
And then what was interesting, this was a spark-based library. Conveniently the spark team, the open source group, made Kubernetes a potential controller rather than YARN. So I said, "Oh, here we go. We can now make a data lake." So the progression was we went from on-prem 500 hours to EMR to learn cloud skills basically. And that is something that still a lot of the researchers use because it's simpler, right? Serverless isn't always the right answer. If you just are going to just kick off a CloudFormation template and run a small size test job on EMR, that's more familiar. You just click and you're done and it takes maybe an hour. And the huge scale which runs on the data lake ... So the difference in the serverless aspect of the solution we built was not in the compute layer, it was in the data layer.
We got rid of HTFS, we got rid of ... we put all of the data in S3, and then the compute layer was a Docker. We dockerized spark for EKS. And I think we were actually one of the first groups in the world to do it. And we published on it. It's all on GitHub. And Amazon actually supported part of that research because of what we were doing. So thanks to Amazon for that. And so what the CSI road teams will do is they'll use the Kubernetes for the really huge scale and when people are more comfortable. So this idea of having like a menu for this particular group was vertical where they sort of start with something that's more familiar and then they move to the serverless level. It worked in this particular case.
Jeremy: Yeah. And now, so that's interesting because you mentioned and obviously you're big into education, you mentioned sort of learning how to do some of this stuff in that progression. So obviously there's a lot of tools and there's a lot of things you need to do with serverless that aren't quite as straightforward maybe as the, I guess, the traditional sort of like you said, the EMR approach for example. So maybe in your opinion, what's the correct level of abstraction then for some of these teams to work with serverless? Because I think right now it's very much so pick and choose all the little things you want, configure them, turn all the knobs, tweak all the dials and things like that. But is there another level of abstraction above that, that might make it easier for some of these smaller teams to start with serverless?
Lynn: Yeah, it's interesting. So I've subsequently started working with another bioinformatics client and quite a large one. It's the Broad Institute at MIT and Harvard. And they are in some ways kind of like a bioinformatics incubator. They have 5,000 researchers that have various labs and they have massive on-premise compute resources because they're well funded. But because of the volume of data, are starting to exceed those resources. So they have had a multi year cloud enablement project, which includes working directly with services, whether it's AWS, GCP or whatever. But they have been collaborating with Alphabet, which is spinoff off from Google. It's actually the Verily Group.
And they have created a higher level abstraction, almost a SaaS level, which is called a tara.bio. So it's a website basically. And what it implements is not only this higher level GUI based abstraction, but within the Broad and in collaborating with other researchers worldwide. All the stuff is open source. They have, for example, created a configuration language called WDL which is Workflow Definition Language, which is designed as a higher level of distraction over like a cloud formation or a Terraform because it allows configuration of the execution environment, so the VMs or whatever. But it also enables configuration of the bioinformatics tools. So another trend that's happening in the bioinformatics industry, and I saw this in other industries like ad tech and finance previous, is as the volumes of data go up, there start to be ... instead of just using scripts or code to manipulate data, there starts to be a tool repositories.
So in the case of the Broad, they have this tool that ... they've made several tools, but there's some big tool is called the GATK, the genomic analysis toolkit. And it's a jar file that has over a hundred common tools. So for example one of the things that you might do in the analysis is you might look for duplicates in the reads and so they have a duplicates feature that you can then configure. So then working at this level, you're more configuring. So if we take it back to sort of working generally with serverless versus not serverless, the amount of configuration code is growing and growing because with serverless you have more parts and pieces. And when you go a higher level up, the configuration code is also extremely important. So it becomes, I think, almost as balanced maybe to the point where you're at the level of terra, you're not really even writing any application code, it's all configuration code.
Lynn: So again, this is like personal what is code? What is it being technical? And I really strongly believe that dev ops and configuration code is code and needs to be checked in and needs to be source controlled and needs to be reviewed and all that kind of stuff. And I think, again, this is a big problem in serverless in general, that there's still this lingering sort of bias that if it's not Java or C++ or something like that YAML is not code. Well, it's our podcast so we get to say, I think YAML is code.
Jeremy: I totally agree with you. And actually I think YAML in some cases and some of these configurations are more difficult than writing code in some cases. So I definitely agree with you on that. So I think all that's fascinating and I find this approach to big data, moving to these things where like you said the cost can be dramatically reduced and whether the computer is on something like Kubernetes or even if like this other example with the RNA sequencing is using Lambda or some other functions as a service, just this idea of being able to store this massive amount of data and being able to process this massive amount of data and use it for something that isn't ad tech.
I mean, it's great when we can serve up a nice personalized ad for somebody, but it's better if we can map the coronavirus and find some sort of a vaccine for it or something like that. So maybe this is a question for you, because you had mentioned, three years ago you really had nothing to do with bioinformatics or genomics and you kind of got into this. Now you're working with the Broad Institute there and that's really fascinating. So is this going to be a big growth sector, you think? I mean, people are saying, "Hey, I want to do good, I want to work in serverless, I want to work in big data." This is sort of the next place for people to go, right?
Lynn: I think so. I mean, there's a couple of things. First of all, even if you're not motivated by the ethical concerns, it's the most data that I've ever worked with. I mean, financial it's sort of similar. But just to make an example, and this is public information because the Broad publishes a lot of customer stories. They are currently putting in on average 17 terabytes per day into the Google Cloud. We all like hard problems, we're builders and so this is a really fun, hard problem. And the patterns used to come for big data out of FinTech and ad tech and I worked in those areas and I'm trying to apply some of those patterns. But I do think that the most interesting place to work for a big data professional is human health right now because of sequencing.
One of the really cool things at the Broad, I had done a presentation there, which again, I'm still intimidated to do on the work that I did in Australia because they were interested in that reference example on AWS.
Lynn: And one of the researchers said, "Have you seen some of our sequencing facilities?" And I said, "I really haven't." And so she was kind enough to arrange a three hour tour of one of their principals sequencing facilities, which used to be a beer plant, which cracked me up. Because in the big freezer they have millions of human DNA samples now, which was the beer storage. But the most impactful thing to me throughout this tour was, because she'd worked there for five, six years. She said, "We used to take really long periods of time to just sequence parts of the genome, the chromosomes or the exomes." But now they have rooms full of Illumina sequencers, they have Illumina people onsite and they are just like constantly just processing, processing.
And this has been within the last three to five years. And now we're even getting testing, like handheld sequencers. And there's really just interesting aspects. When I was in London last year, there was a new store, first one in the world called DNANudge. And what they do in the store is you spit in a tube and within one hour you get the results back and it's a sequence. And I wrote again, a medium article about this, but it's a very small subset. Again, you have 3 billion letters. So what they do in that store, it's almost like a Regex on the genome. They go to certain spots and they say, okay, this is ... one example is caffeine metabolism, which I happen to have high caffeine metabolism.
So they go to one spot. It's not going to be all caffeine metabolism, but just a known spot. Do you have high, medium or low? So it's very much a subset. And then what they've done, it's really interesting concept, is they give you a wristband and it's like a Fitbit, so it tracks your activity and you store the results in there. It's a little tiny chip. And then they partnered with a major grocery store. And based on the eight characteristics that when you go shopping, you can actually say red, green, or yellow for those characteristics and your activity.
Jeremy: Oh wow. That's amazing.
Lynn: It's super interesting. So this idea of different kinds of sequencing, not sequencing the whole genome, just sequencing parts of the genome. And very timely. The Broad has actually published on this. They're actually working on an improved coronavirus test. Not the exact same principle, but kind of that idea so that the results can come back faster and be more accurate as the sequences are verified. Because again, one of the things that's been really fascinating to me about the coronavirus is, again, I haven't obviously working with genomics for a long time, but the fact that the Chinese sequenced their virus immediately, they released the information and they are open source repositories. One of them is Galaxy, people that I worked with that are actually published a workflow on GitHub and they've published a paper. I mean this is the dream. And of course it's all running in the cloud. And as they get new data coming in, they update the paper. So to enable citizen science, people working together faster because of the cloud.
Jeremy: Yeah. Again, it's fascinating to me and I think the impact just from a health standpoint and ... I don't know. I mean, this is going to bring us to the point where we are able to find better cures for cancer or to test drugs faster or to see how the changes in the environment ... I mean, that's something I'm interested in as well as this, what are the environmental impacts or the longterm environmental impacts of some of these things and being able to test changes in DNA and that kind of stuff based on environmental factors and all that stuff, I think it blows my mind. So certainly congrats for working on this stuff. And I think this is where we can maybe change the conversation to you being recognized as an AWS data hero because this is true hero stuff in my opinion that you're doing with data.
So again, thank you for the work that you do. But you're not only a data hero, you're also a Google Developer Expert. And I think you were maybe the first one that was announced as a Google Developer Expert?
Jeremy: Which is pretty fascinating. So I'm going to make this typical thing I think that most men with daughters say. But I'm going to put it out there anyways and just say this, I have two young daughters. I have an 11 year old and a 13 year old daughter. One is really, really great at math. The other one is very, very interested in biology. And I really, really hope that they stick with those STEM type subjects and that they continue to go and pursue those things. And that hopefully this country and this world will allow them to continue down those paths with as few barriers as possible. But I think that that has been a major thing that has challenged women in the STEM education or STEM jobs and things like that. And you're one of those people who I think is truly inspiring to my daughters for example, because you just broke through all that and now you're doing stuff that anybody can look up to.
Lynn: Well, thank you. Well, it's all about teams. And one of the ways that I'm able to accomplish providing value to my customers is by working with like minded people from all over the world. And one of the successes that I've had is this idea of learn, build, teach. And again, I wrote about it on Medium because I always have to share.
Lynn: And I encourage it in the people I work with. For example, I just had this young woman, I'll tell you this story because it just makes me so happy. She was one of my students from LinkedIn Learning and they write to me and they always ask me to mentor them. And I actually don't personally find mentoring valuable for me or the person. Now, I know a lot for a lot of people they do. That's great. But what I do is I consider having people work as interns. And the way that it works is they work with me remotely for a couple of one hour sessions, like three or four. And then if we both decide we want to continue, I'll actually hire them as subcontractors. And for this particular person, she came out of the finance industry, she's a grad student. And she had a dream to be a cloud architect and this kind of stuff. And so we worked together for about six months. And she's just left my internship because she got a full time gig with Amazon.
Lynn: And I told her, I said, "Okay, you tell me that the way that I work has been helpful to you. Now it's your turn, you learn, build, now it's your turn to teach." So I try to get that idea going with people. Because I think that it's not about an age or about a degree or whatever, it's about what you have done. And I think that that applies even to younger people. When I've worked with some people even in high school that have done hackathons, if they built something learn, build, teach. Because what you're doing then is you're giving back, but you're also establishing your competence. Because another severe problem we have in our industry is bias in tech interviewing. I mean, I'll just say, because I think it's important for people here. I have failed many tech interviews, many. And even with my experience, it's very disheartening because I don't have a CS degree, I'm self taught.
And the interviews at the big companies test for one very small set of skills. They test for did you go to Stanford and take algorithms, and I didn't. And yes you can spend hundreds of hours and memorize that just for the interview and that might be okay for some people. But I just feel that the industry needs to grow up because we have all these very competent people that are technical and I define technical as somebody who has a curiosity and an ... I also want to say when you're driving a car you would like to open the engine and look inside, you're technical.
Jeremy: Very good point.
Lynn: Yeah, you're technical and I don't interview and I don't conduct interviews. Again, I wrote another Medium article about this. I was talking to a founder in Berlin recently and he was like, "How do we get people, everybody's failing our interview process." I said, "That's because your interview process is crap."
Lynn: I said, "You just go to hackathons and you work with people." It's all about working together. And then you understand what does the person do when they get stuck? Do they Google the problem? Do they ask for help? Do they try to compute it in their head for 27 minutes and not talk to you? Who is going to make the most contributions to your team?
Jeremy: Yeah. No, and I think it's a confidence thing as well. I tend to find that people who know a lot of things tend to think they don't know a lot of things. And people who don't know a lot of things, they know everything, right?
Jeremy: Or at least they think they do.
Jeremy: And that's actually one of the problems that I've found even when I ... because I've interviewed for several companies or I've interviewed people for those companies. And I mean, even just talking to people to be guests on this podcast. I mean, I think I find this where you have a lot of women who ... and it's not that they don't have the confidence themselves just, I think they think that maybe they're not enough of a voice to lend to the community.
And again, I can find men who have no idea what they're talking about that are willing to talk about anything. And I don't mean to criticize. I mean, all the guests on the show have been great, but there are a lot of people that I would love to hear from. And I think there are a lot of people that the community and the tech community needs to hear from. And just the culture and whatever else it is, is just of course hundreds of years or thousands of years of history suppress some of those voices. And that I think is a tragedy that that needs to be corrected. So when you have more people like you speaking out then it can inspire other people and like you said, this idea of saying, look, it doesn't matter what your background is, it doesn't matter what college you went to, it doesn't matter if you went to college at all, do you have the attitudes? Do you have that drive? Are you willing to put in those hours and learn that stuff?
And then you get to that point where if you share those things, yes, sometimes you get criticism. But I mean, I always found whenever I share something, I try really, really hard to make sure it's right. I do more research. I dig in deeper. And then when you start getting criticism or congratulations or whatever it is, you get feedback and feedback is always good. It makes you a better person. So I'll say this again, thank you so much for everything that you do and I really appreciate you being on the show and sharing all this stuff on big data. So if anybody wants to get in contact with you, how would they do that?
Lynn: Twitter is probably the best way. I'm really changing kind of my professional profile. For many years I was an international speaker and I'm stopping travel for personal reasons and just because it's a good time to stop travel. So I won't be doing any talks other than near where I live in Minneapolis, Minnesota now, but I'm going to be writing a lot. So yeah. And then I have, of course, I have 30 courses on LinkedIn Learning, so if you want to listen to me there, I have lots of courses and GitHub. Again, learn, build, teach. Whenever possible, when I learn something, when I'm building something for a customer, as long as I can make it generic version or whatever, I will put it on GitHub.
In particular, I have a course on GitHub called GCP for Bioinformatics where I took my LinkedIn Learning course, GCP Essentials, which is an introductory course and I just converted all the examples to bioinformatic data and bioinformatic examples. I would love to get more collaborators on that. I would love more input on that. It's GitHub so you can do pull requests or whatever. And the course is about 50 markdown pages. It's like a quick graph basically; what is the service? Why would you use it? How do you use it? Focuses on the console and then quick short screencasts like five minutes screencasts. So the idea is if you're a researcher, you can learn how to use the cloud with some guardrails of understanding cost and everything just by going to this repository. And I want it to be useful so I've gotten some feedback, but I would love to make that course even better.
Jeremy: Awesome. And then lynnlangit.com as well, right?
Jeremy: Okay. Awesome. And that's L-Y-N-N-L-A-N-G-I-T.com.
Lynn: That's right.
Jeremy: Awesome. Well, thanks again. I will make sure that we get all this stuff into the show notes.
Lynn: Thank you so much.
THIS EPISODE IS SPONSORED BY: Stackery