SDS PODCAST EPISODE 215 WITH BRIAN DOWE - Amazon Web …€¦ · full-stack web developer Brian...

SDS PODCAST

EPISODE 215

WITH

BRIAN DOWE

http://www.superdatascience.com/215

Kirill Eremenko: This is episode number 215 with a full-stack web

developer Brian Dowe.

Kirill Eremenko: Welcome to the SuperDataScience Podcast. My name

is Kirill Eremenko, Data Science Coach and lifestyle

entrepreneur, and each week we bring you inspiring

people and ideas to help you build your successful

career in data science. Thanks for being here today

and now let's make the complex simple.

Kirill Eremenko: Welcome back to the SuperDataScience Podcast, ladies

and gentlemen, very excited to have you on the show

today. Today we've got an aspiring data scientist and

full-stack web developer Brian Dowe joining us, and I

literally just got off the call with Brian a few hours ago

and what I can say about this episode is it's very

inspiring, especially if you are a web developer yourself

or a developer of any kind yourself. You will find a lot

of useful tips and insights in this episode.

Kirill Eremenko: If you're not a developer, you will also find a lot of

useful tips. But I personally found a very, like Brian's

example of how he structured his career, very

insightful and something to admire and kind of like

dissect, and that's exactly what we did in the podcast.

So you'll find out ... like we touched on three main

things. So first of all we talked about what it's like to

go from a developer to data scientist or in fact how to

integrate data science in your career if you are a

developer, if you work on web development or any kind

of development, how to integrate data science in your

career. In fact, I think anybody looking to integrate

data science will find a lot of these tips useful.


Kirill Eremenko: Then we talked about models. We talked about

developing models, deploying models in business and

maintaining models. In fact, we're going to dissect the

whole case study of a recent model that Brian has

worked on using that Apriori algorithm, and you will

hear some back and forth between us about the

development, the deployment, and the maintenance

life cycle. We'll actually come up with some ideas on

the podcast, which you might find quite interesting in

terms of brainstorming and how to think about

modeling. And in general you'll get a lot of takeaways

about modeling.

Kirill Eremenko: And finally, we talked about getting into the space of

data science or how, what it's like to learn data

science, what are the challenges in the learning curve

that Brian has been facing. He's been a learning data

science since the start of this year, so for almost a year

plus a minus a couple of months. And you'll also hear

about the tools, some of the tool recommendations

that Brian has for you if you're just starting out into

data science.

Kirill Eremenko: So there we go. This podcast quite interesting in terms

of the three pillars that we discussed. We get a lot of

value from here and without further ado, I bring to you

Brian Dowe, full-stack developer and aspiring data

scientists.

Kirill Eremenko: Welcome to the SuperDataScience Podcast, ladies and

gentlemen, today we've got a very exciting guest on the

show. Brian Dowe calling in from San Mateo,

California. Brian, how are you going today?


Brian Dowe: I'm doing good Kirill, how are you?

Kirill Eremenko: I'm doing very well. Thank you very much. It was such

a pleasure to catch up at DataScienceGo mate, it was

exciting to hear your story, I can't wait to share it with

our audience today. But first off, how do you feel at

the events and how do you feel after it?

Brian Dowe: It was really an incredible experience. There were so

many fantastic presenters. There were a lot of people

that I sort of networked with and talked to outside of

the sessions as well, and got a lot just all around out

of everyone I interacted with. It was an amazing

experience, and I learned so much and it's given me a

lot of great jumping off points moving forward.

Kirill Eremenko: Yeah, thanks. That's really great to hear and you just

for the podcast, you mentioned you were already

getting into like you made this tilt or shift in your

trajectory, and you started doing stuff on Kaggle, after

DataScienceGo, you started the Andrew Ng machine

learning course. Quite a few things have happened for

you. What would you say has been the biggest shift

after attending DataScienceGo 2018?

Brian Dowe: I think that before I attended I had had some

experience with applying machine learning models and

sort of an easier setting where the Dataset is pre-

prepared for you and it's pretty clean, and you just got

to get right to the modeling. And so one thing I wanted

to do following up from DataScienceGo is push my

knowledge a little bit further. So I actually went

through and did like the derivations and the calculus

for gradient descent and that made a huge difference


for me in just understanding what's going on under

the hood.

Brian Dowe: And so even if the ... it can be more convenient to use

libraries like Scikit-learn, Intensive flow to do projects,

it's still really helpful to understand what's going on

behind the scenes so that you can work with those

tools more efficiently. So that was really ... yeah, that

was really huge for me.

Kirill Eremenko: Awesome.

Brian Dowe: So with Kaggle, I've been going through just some of

the datasets that they have posted and trying to make

some like predictions even with just like basic models,

but that's given me some good practice in data pre-

processing and organizing a sort of not clean dataset

into something that can be fed into a model and that's

been some really valuable experience for me.

Kirill Eremenko: Fantastic. Great to hear. Amazing. One more thing on

DataScienceGo, I was curious about this, what was

your favorite talk? Who was the speaker who gave you

a top talk?

Brian Dowe: I think one of my favorite ones was that Gabriela de

Queiroz from IBM. Her talk on a deploying or I think it

was deploying machine learning models in five

minutes or something like that-

Kirill Eremenko: Deep learning. I think it was deep learning models in

five minutes.

Brian Dowe: Yeah, exactly. And just she showed us the model asset

exchange that you could use to sort of find a bunch of

prebuilt and pre-trained models and start deploying


them very quickly. And I thought that was really

interesting and walked away from that with some

items on my to do list to go through a lot of resources.

I really enjoyed her talk a great deal.

Kirill Eremenko: Fantastic. Well that's very exciting. Well Brian, very

cool to have in you the show and one of the reasons is

because I am excited, super excited about your career

path. I think you have a very inspiring journey that

you've created for yourself, and I would love to share it

with our listeners. And what I mean here for our

listeners is that what you need to know about Brian is

that he is a full-stack developer, web developer, and

we'll talk more about that in a second. And Brian sees

the value of data science, sees the value of machine

learning and deep learning and is actively applying

that in his career.

Kirill Eremenko: And I know that's across our audience, across the

listeners in our podcast, a very large percentage of you

guys, I don't know, maybe 30, 40 percent, that's my

rough estimate, it might be even more, are people who

are developers who are also looking to get into data

science or are already in data science, have

transitioned into data science or have seen the value of

data science. And I think it is ... this is going to be an

inspiring story for you guys to model in your careers.

But even if you're not a developer, the steps that Brian

has taken to integrate data science in his career

without fully jumping straight into it and quitting his

job and just going data science, data science, is quite

inspiring. So I think that's going to be cool. So I'm very


excited to dig into it. How are you feeling about this

Brian?

Brian Dowe: I'm really excited. Yeah. I'm excited to dig into this as

well.

Kirill Eremenko: Awesome. Some good self reflection opportunity for

you, I guess.

Brian Dowe: Yeah.

Kirill Eremenko: Awesome. Okay. So tell us a bit about you Brian. You

are a web develop, a full-stack web developer. If

somebody off the street were to ask you, Brian, what is

it that you do? What is a full-stack web developer do?

How would you answer that question?

Brian Dowe: Sure. So generally when people ask me that because

it's come up a decent amount in conversation, I'll say

that full-stack is a combination of front end and back

end technologies. And in my specific case, I worked for

a company called Education.com, it's a web

application. And so for web applications, front end

usually has to do with building templates. So all the

html and CSS are the parts of the site that users see

visually and also some Java script for interactive

components. And the backend side is like the database

and APIs that interact with your database and grab

data to display to the end user. So the front end would

be like what you see when you look at a webpage, and

the back end would be like what happens behind the

scenes when you click a button to submit a form, like

where's that data going? What is it doing? Most of that

is handled by back end technology.


Kirill Eremenko: Gotcha. How long have you been with education.com?

Brian Dowe: It's been about nine months now, so not very long. I

interned for four months, and I've been full-time for

about roughly the last like five and a half months.

Kirill Eremenko: Okay. Gotcha. Tell us what attracted you in data

science, like why are you on this podcast, how did you

get into this, hear about data science and what the

next steps did you take from there?

Brian Dowe: I actually first learned about data science, Kirill,

through your course on machine learning A to Z. And I

started doing it maybe about two or three months after

I had started working for education.com. And prior to

that I had built some applications just to sort of teach

myself that were, had like very simple data

components, like maybe just a user database and

maybe like the ability to make blog posts or write

reviews, etc.

Brian Dowe: And when I stumbled across machine learning A to Z, I

had heard of the field, I didn't really know too much

about it. And I remember watching your introductory

video where it kinda goes over a lot of the applIcations

and use cases and machine learning. And I thought,

oh, this is really cool this provides a really interesting

way to look at data, gain insights from it and improve

the actions that you can take on the basis of that. And

so that was really interesting to me, and I think as I

progressed through the course, I found it to be more

accessible and something that I could jump into even

though I didn't have too much experience. And yeah, it


just kind of progressed from there. And my interest of

it has only grown over time.

Kirill Eremenko: Interesting. Let's rewind a little bit. Tell me this, how

do you stumble across a machine learning A and Z

course, what were you searching online for? Obviously

there was some kind of a need that you were trying to

fulfill when you saw it. Like people don't normally

stumble upon machine learning A to Z unless they're

actually looking for something related to data science,

like what was the initial trigger for that to happen?

Brian Dowe: I had been using Udemy for a long time before that for

projects that were [inaudible 00:12:25] specifically to

data science. So I mentioned before how I was trying to

get started by building a simple application to develop

my development skills basically. I had taken some

courses at that time on building a clone of YELP with

ruby on rails, and then I found one on building a price

alerts app with Python and flask. And so throughout

the course of my learning, my Udemy feed or the

courses that popped to the top had to do a lot with app

development and with programming in general and

with Python also, specifically over time. And then I

think it was through that your course just sort of

popped up on my list.

Kirill Eremenko: Okay. Like you got into machine learning and data

science because machine learning and data science got

you in there in the first place. That's what it sounds

like.

Brian Dowe: Yeah. Sort of, yeah.


Kirill Eremenko: The circle has closed, right? We've gone full circle, this

is so cool, right? Like you're studying exactly the stuff

that has influenced your career to study that stuff.

This is like Inception level type of thinking, right? Have

you ever thought of it that way?

Brian Dowe: I haven't, but that's really interesting now that you put

it that way.

Kirill Eremenko: Oh wow. That's so cool. That's so cool. One of the

craZiest stories. Okay. So you got into machine

learning and you started taking the course and at the

same time, how were you able to apply this at work?

Like you're a full-stack developer, and this is where

the interesting stuff starts to come in because I know

your story a little bit already. How were you able to

take that into ... as a developer, if I'm a full-stack

developer, I might be a bit like shy or I might be a bit

... not even come to mind that I can bring this stuff,

machine learning to work. It's not applying. It's

completely unrelated to my role. So how did you go

about that?

Brian Dowe: That's a really Interesting question and I definitely did

feel that like not knowing exactly where to start, where

the right opportunity was and it happened in a very

roundabout and sort of like by chance way. Our

company, we try to hold at least one hack week every

year. And for those who might not be familiar with

what that is, it's generally where everyone can sort of

come up with their own projects that they want to

build outside of the development pipeline and then

people can team up and then just try to build whatever

they want and then we all present to each other at the


end of the week. And by the time hack week came

about, which was about like may of this year, I had

been studying machine learning maybe for like two to

three months.

Brian Dowe: I had talked ... like briefly mentioned it to some people

at work in passing. And then when hack week came

about, one of my colleagues approached me and he

wanted to build a recommendation system using

machine learning. And that was basically the extent of

the background for it. And, yeah. So I remember

sitting down with him sort of trying to brainstorm

what to do ... before I get too much deeper into that,

this hack week was really the catalyst for a lot of the

professional application of machine learning or in my

workplace to happen just because it provided an

opportunity to just build a project that you're

passionate about without any other restrictions. I

thinK looking back, that really empowered me to start

bringing up ideas and trying to make things work

outside of that context and just do it as part of our

normal pipeline of projects.

Brian Dowe: But if I could, looking back, make a recommendation

to someone in a similar position, it would be to not

hesitate to just dive in and start trying to find

problems to solve. Because I think a lot of full-stack

developers have access to a wide breadth of data

because you have to work with it when you build your

applications. And so I think just sort of diving into the

data that your company has and then trying to figure

out, okay, what's a way that I could use this? Like

what's something that I could predict based on certain


features or what value can I extract from this that

could be presented to the end user in some way.

Brian Dowe: I think my recommendation would be to just dive in,

start thinking about problems to solve and when you

have something or an idea of how to go, odds are that

the other developers you work with will be intrigued by

this and curious about this. Do you have an idea that

could potentially bring value to the company? I think

like trying to empower yourself and just dive in and

start looking for problems to solve is one of the best

ways to get started.

Kirill Eremenko: I totally agree with that. And I just wanted to mention

two points here. First one is that, you also obviously

need to be careful as a developer. It depends on

company to company, but often you do have potential

access to data, but for instance, in Facebook, if you

use it for the wrong reasons, you'll probably get fired.

So kind of like, be sensible about that and maybe

confirm with your boss if you can use that data or

maybe if you, if that's a bit too early to do, maybe

create some dummy data sets that are similar or you

have The similar columns in terms of structure to the

company data set, but like play around or download

some data sets in your free time to play around with

before you actually play around with real data that is

customer base, especially if it's sensitive data. That's

one comment I just wanted to make.

Kirill Eremenko: But in general, indeed, like even if your company

doesn't have these hack weeks or hackathons, which I

think are very useful especially, but they usually take

place in large organizations that want to inspire


innovation. So if your company doesn't have that job,

maybe you can talk to the management to start

introducing it. But even if that's not the case, you can

still play around with data or dummy data or similar

data and see how you could potentially bring value to

the business. And you can still bring those results and

those suggestions to your management or to the

company leadership and present it to them. Every

company in this day and age wants to be data driven

or model driven. They will jump on top of it. If you can

add value to the bottom line of the company, it is a

very rare instance that management, especially when

it's concerning technology and data science where the

investment isn't that high, but the return on

investment can be massive. Very rare do you get cases

where management turned down those ideas, turned

down those innovations.

Kirill Eremenko: So as long as you're proactive, don't let the absence of

a hack week or a hackathon get in your way. Or if for

instance, you're a hack week is 11 months away, it

happens once a year. If it's 11 months away, don't

wait, you can already do it now. Those are just some of

my thoughts on that. Brian, you were talking about

you were preparing for this hack week and a colleague

of yours came up to and then you got into this project.

What was this whole idea? You said you were going to

dive deeper into that topic.

Brian Dowe: Sure. So the project was a recommendation system for

our content. So just maybe to give a little bit of

background information on what we do in

education.com, we're a platform for parents and


teachers to come and find resources and teaching tools

to help their kids. And so when I say like a

recommendation system for our content, I mean we

have just a bunch of worksheets and games and

lesson plans and all of this stuff that parents and

teachers can come and use.

Brian Dowe: So what we wanted to do is create a way or improve

the way that we recommend other resources that you

can use based on what you're viewing. So if you like go

to our website and click on a worksheet, you'll see at

the bottom, here are five other worksheets that are

related to this one that we would recommend for you.

And so, yeah, so just to give a little bit of background

on that.

Brian Dowe: But anyway, so a colleague approached me, his name

is Yon Burke, a brilliant engineer. So he approached

me wanting to build this recommender and as we were

sitting down and sort of talking about what some

different angles are that we could go about this with,

like what data would we use, like how I'd rearrange it,

like what model would be needed? He brought up

something or he said, "What if we're looking at this

problem the wrong way, what if this problem is people

who downloaded this also downloaded this" Or like,

what if that's the way that we're going to look at it.

Brian Dowe: And as soon as he said that, a light bulb in my brain

went off and I thought, I know that I learned of an

algorithm in machine learning A to Z that's perfect for

solving this exact type of problem. So I went through

and I looked through all my notes and it turned out

that that algorithm was the Apriori algorithm. I know


that I made that connection because the example given

was using that algorithm for a grocery store to try and

figure out a where to place their products in relation to

each other based on what products most users or

most customers would buy together. And so I

remember it just ... like light bulb moment went off

and made that connection, and then I knew that was

the place where we had to start based on the way that

we had a scoped the problem now.

Brian Dowe: So what I did was I went back and I looked at the

slides from machine learning A to Z from that section

and In those slides, Kirill, you went over the equations

for support competent and list for that algorithm. And

then using those equations, I scripted or I just

converted that into code basically in Python and then

set up the structure to loop through all of our data

and gather the information that we needed.

Brian Dowe: And Yon, the engineer, I was workIng on this with did

the pre-processing portion and then got me the

dataset. And then we started testing and trying to

figure out if it was going to work, if it was gonna give

back good predictions because one of the attributes of

this algorithm that made it a little challenging to figure

out whether it was going to produce good results is

that we didn't get the immediate feedback on whether

it's a correct or incorrect estimate, for example, like

you would with a supervised machine learning

algorithms.

Brian Dowe: So we had to look at some of the results by hand and

figure out, okay, is this producing the results that we

wanted, and what we were specifically looking for was


to have sort of more loosely defined recommendation

than we currently had for a given worksheet because

let's say you looked up like a two digit multiplication

worksheet on our site, what we had before, it would

have just given you like three or five [inaudible

00:24:28] multiplication worksheets.

Brian Dowe: And we were thinking, okay, like how useful is that

really for someone who is coming to our site and

looking for another jumping off point. Like if they find

a worksheet, are they just going to want to find more

worksheets that are exactly the same? Probably not.

What might be more useful is like maybe they're

looking for like two digit division, something that's

related to what they're on, but not exactly the same.

And so we were hoping that by using a user download

histories, instead of just like a content tagging or like

what the exact subject was, that we would get some

better results that would actually be more beneficial to

users.

Brian Dowe: And as we went through, that turned out to be the

case and we saw that we were getting the

recommendations that we were looking for. And then

after we presented at the end of the week, it started

picking up steam and enthusiasm and then we

decided to run it on the site as an AB test. And after

running it for a while, it turned out that it won against

our former recommendation system, and then we

pushed it to production and it became our first

machine learning model that was deployed on the live

site. So that was a huge win and a really exciting first

step in implementing machine learning. But the first of


many is what I'm very much hoping for and striving to

make the case is that this is just the first step on a

longer journey to integrate a more machine learning

into our operations. And so yeah, it's very exciting

stuff.

Kirill Eremenko: Wow, that's awesome. Congratulations. That's a huge

project and most importantly that it actually got

implemented into the business and added value, that's

just really cool to hear. And just to recap, so let me

know if I'm getting this right. So before the

recommender system was recommending more content

to your users based on the tagging. So for instance,

you have these multIplication sheets and maybe your

tutorials, maybe some videos, some other content that

you have all sorts of content, they're all tagged, like

this is multiplication, this is for first grade, this is for

fourth grade, this is math, this is this other topic and

so on. And based on the tagging, your similarity of

tags, it would recommend something.

Kirill Eremenko: Where's your new recommender system, which is

through the Apriori algorithm, it would look at not just

the tagging, but it would look at what are users

actually downloading. So if people on average

download X, what did on average they download after

that or what did most of them download after that? So

you're looking at the behavior of your users and based

on that you're saying, okay, well even though it's not

tagged identically, looks like that's what people want,

that's what people are after, and we're gonna use that

as a suggestion. And so that Apriori algorithm


approach work better than your previous

recommender method. Is that about right?

Brian Dowe: Yeah, that's essentially the gist of it.

Kirill Eremenko: Gotcha. Sounds a lot like Amazon. Like you go on

amazon and you buy something and then out of the

blue they recommend something else, it might not be

related at all, but that's because most people kind of

like after they bought that one thing, they usually go

searching for that other thing; the people that are

similar to you, I guess.

Brian Dowe: Yeah.

Kirill Eremenko: Gotcha. And then after that you did a quality

assessment like not a quantitative assessment, as you

mentioned because you didn't have like with

supervised learning algorithms, you couldn't say yes,

no, correct or incorrect. In this case, you did a

qualitative assessment where you just went in and you

tested out a few of those recommendations to see if

what the recommendation was actually made sense to

you. Like the example you gave like a three by three

multiplication table instead of getting a four by four

one, you get a division table, which kinda like makes

sense. Is that what you did next?

Brian Dowe: Yeah, that's essentially what we did. We were looking

for things that were like not exactly the same but still

would likely to be within the same subject area that

might be closely related, for example, yeah, like that

example that I gave hoping that users would be maybe

studying those two subjects at the same time and they

could find things that would be more directly related to


like where they would go next from a given subject or

something that they might be also practicing at the

same time.

Kirill Eremenko: And so how long has this system been in place now?

This a recommender algorithm?

Brian Dowe: It's been a couple months now that's been up. The

hack week was in May and after there were some

additional steps that needed to be taken to a to set it

up and running with our whole dataset and that took

some time. But yeah, for a couple months now it's

been up and running and live.

Kirill Eremenko: How are the other results? How is the management

seeing, are they seeing some positive impacts? Are

they happy with how like this change, what has

change has brought?

Brian Dowe: Yeah, it has brought a positive change, but we're

definitely also looking to see how we can improve it

because this was just sort of like a first run like it did

win the AB test, but we were still looking for ways to

improve and to provide more valuable

recommendations to our users. And I think, yeah, that

this was a good first step to at least identify resources

that are being used and consumed most often. But I

think there's so many places that this could go and

there's so much that we could still do to improve our

recommendation capabilities and that's something

that we're diving into right now.

Kirill Eremenko: That's so cool. And I can just imagine the boost of

morale that you got from that, like when your

algorithm got implemented, how did that feel?


Brian Dowe: It was incredible. I think one big takeaway that I had

from that is that when I was first studying machine

learning, sort of leading up to that, and I think still a

little bit following from the implementation of that

algorithm. My focus was I wanted to learn all these

like, cool, cutting edge tools like neural nets and like

Gans and stuff like that. Like all this some of the more

advanced modeling tools, but the process of actually

sitting down, looking at what data we had available

and trying to choose a model that fits best with that

sort of showed me that you have to start from the

problem and then try and find the model that best fits

the solution rather than just picking a cool model that

you have in mind that you really want to work with

like a cnn and then searching for a problem that fits

that solution.

Brian Dowe: I think that can be like a great way to learn about a

new technology like if you're focused on learning about

like cnn's and you look for problems as specifically fit

that use case, that's great for learning. But in practice,

you don't always get to choose the problems that come

to you, sometimes there's just a problem at hand that

needs to be solved and you need to explore your toolkit

and just choose whatever is best fitted for that

solution. Like sometimes I think I've seen cases or at

least read about vaguely cases where a bunch of

models were tried and a simple linear regression or a

simple logistic regression was the best outcome. And

sometimes that's the case.

Brian Dowe: Sometimes the model that you initially had in mind

that you were focused on, isn't the best tool for the


problem and something that maybe is a little bit

simpler and scoped is. And so I think that's something

that I thought about a lot is it doesn't have to be like

the craziest algorithm out there to make a really big

impact. And so that's something that I think about

going forward is just starting from the problem first

and then searching for the tool that solves that after

the fact.

Kirill Eremenko: Very, very wise words. Totally agree with that. It's very

easy to get carried away with a machine learning and

all of these new shiny cool things. And that's awesome

to learn them and they're very good to inspire you to

learn and to grow and to find like these different

nonstandard applications, but at the same time,

sometimes less is more. You just go for what solves a

problem and does that efficiently. And in your case it

was the Apriori, which is awesome to hear.

Kirill Eremenko: Another thing I wanted to ask you on this topic was

that the integration of the algorithm into your

company services. I think this is a very cool area that a

lot of the time is missed by data scientists that you not

only have to develop an algorithm, but you have to

also operationalize it. You have to make it work in the

company so you have to somehow put it into

production, it has to integrate with the website and all

those things. So I think that'll be really cool to talk

about that. Are you able to share some details with us

on how you went about it and what challenges you

faced along the way?

Brian Dowe: Yeah, I could share some details about that. A lot of

the bulk of the work to actually set this up on


production was handled by my colleague who has like

a lot more familiarity with the way that these systems

work. A big challenge was just being able to run all of

this data because we have so much data in user

download histories from like all of our users and all of

the things they downloaded it. It gets pretty big pretty

quickly, and so knowledge of a cloud computing

services became really helpful in this case. And that's

what we were able to use to do this to do this. And so

one thing that, a challenge that came along with that

is trying to decide when and how often the model we

need to be wrong. Because I think that's a question

that is really important is like do you need to make

actual calculations with this model on the fly or can

you run it like every so often and use the results or

store the results from that to display to the end user?

Brian Dowe: And so we ended up going with the second approach of

just running our algorithm on every so often to gather

this information like for a given worksheet or game

that a user downloaded or played, figuring out the top,

so many worksheets that are associated with it, which

is mainly the tasks that the algorithm was

accomplishing. We decided that it would be much

more efficient for us to just run this every so often and

store the results so that when a user visited the

worksheet, we can just pull it directly from our

database without having to actually run the

calculation on the fly.

Brian Dowe: And so I think that's something that can vary from use

case to use case and model to model depending on

how you're trying to apply it, maybe you will need to


actually run a prediction or a calculation on the fly

when a user clicks a button for example. And that can

be ... whether or not that's a feasible place to

implement your model, I think depends on the weight

of the calculation that it needs to make. And so if it

needed to make a calculation that required a lot of

processing time but the user needs the result

immediately, then that might not be a feasible way to

go versus if it's something that might not change

drastically from day to day or week to week, then you

could just run your algorithm every so often and then

store the results within a structure that you already

have set up to store data for example.

Brian Dowe: So I think it depends on the use case, it depends on

what you're trying to do and how often the results of

the model need to be updated. And I think yeah, there

are definitely ways to think strategically to integrate

systems like this without running massive programs

like every time a user clicks a button for example.

Kirill Eremenko: Okay, gotcha. Very, very cool advice as well that

there's different types of models and hope people are

taking notice of this, that sometimes you might need

to run them on the fly and get results all the time,

sometimes it's enough to run them occasionally. And

especially big companies with lots of data, they have

time slots for running things on the servers, so usually

they run ... like I was at a company after leaving

Deloitte where they would run things at night and

their every single hour and sometimes usually in 15

minute chunks or maybe even shorter chunks, the

whole night is split into these periods where you need


to apply to get server time locally or ... now more

things and more and more things are going to the

cloud, but still a lot of companies do things on

premise, do calculation on premise.

Kirill Eremenko: And so they, in order to run all these models and do

all the calculations, update all the data and stuff like

that, especially like for example, the financial services,

you gotta turn over all the things that happened

during the day, make sure everything is accurate,

reconcile a lot of stuff. That all happens at night and

you want to be very mindful of how you're using server

time in your company. But even if it's in the cloud,

every time you rerun the model, it's still going to be a

cost, it's still going to take some kind of budget or

something. So being conscious about whether you

need right away or not is quite an important thing. I'm

going to ask you this question, so I want to see in a bit

of a different space, have you taken the data science A

to Z course?

Brian Dowe: No, I don't think I have actually.

Kirill Eremenko: Okay. So the reason I ask is because in the data

science A to Z course we talk about model

maintenance. And I just wanted to see like ...

obviously this has been a big success and

breakthrough for both you and the company in terms

of developing and deploying this Apriori model. What

are your thoughts on how you're going to maintain it?

And the reason I ask is because I've seen situations

where models deteriorate. In fact, I've seen a situation

where a model used to be very effective, like it would

bring like 80 percent accuracy, but then over time,


over a period of 18 months it deteriorated to a level

where the accuracy was less than 50 percent, was like

42 percent or something. Meaning that it would have

been more efficient for the companies to just flip a coin

and do the recommendations based on that rather

than using a model. And so I'm just curious, what are

your thoughts on how to maintain this model?

Brian Dowe: Yeah, that's a really interesting question. I think to a

certain extent, this can depend on the domain that

you're in. And what I mean by that is I think like

different domains have a different pace to how their

data shifts structurally over time. And so for us, who

are in the education space, at education.com and that

kind of means that we follow the cycle of a school year.

So for example, given parent or teacher, based on the

time of year and based on when they signed up, like

maybe they sign up in like August and then from

August through June, they're using a bunch of second

grade resources. And maybe for a parent like they then

move onto the next grade and then they're going to be

consuming a bunch of third grade resources.

Brian Dowe: So, if we were to just store and continually update the

model just based on new data coming in and not deal

with eliminating old data, it would undoubtedly begin

to grow stale over time in the sense that you're now

will be making predictions based on users who had

been with the platform for several years. And so you

might not get the best recommendation for a second

grade worksheet because you'd be pooling from

download histories from a bunch of different grades as

well. And so, I think for us, that's something that like


already implemented as part of the algorithm is to

keep it seasonal and to try and look at data that's

irrelevant to the time period that we're in relative to

the school year.

Brian Dowe: And so I think thinking about things like that, like how

often ... like for us right now, like grade is the main

way that we look at this, but I think it could be

interesting to dive in and see how subjects covered

might change over time. And, yeah. But for now,

grades and school year is like the main data that we

have to go on because there's a lot of different

curriculums out there and our site is used very widely

and so it's a challenge to find a way to update your

algorithm in a way that meets the needs of all the

users on your platform. School year and grade just

happens to be one way that's pretty standardized

across our users, but I think trying to push that

further and make it even more relevant to what a given

user might be looking for is definitely something that

is a challenge that we're going to have to rise up to and

meet overtime. A really challenging problem.

Kirill Eremenko: Okay. Gotcha. That's a very cool consideration about

the seasonality. If I may, I'd like to give you another

suggestion, is that okay?

Brian Dowe: Sure. Yeah.

Kirill Eremenko: So one thing I was thinking about is with your model,

you could measure how well it makes those

recommendations. So you could say, all right, we

made all these recommendations, how much of those

recommendations was actually used and how many


cases, and what percentage of the cases was the

recommended content actually consumed by the user

to whom it was being recommended or consumed

within a relevant timeframe like it could be maybe they

didn't consume it right away, maybe they had to think

about it. But like within a week or a month they did

indeed consume the content because if you have the

right data points in place, you can actually collect that

information like we recommended content Z and they

didn't use content Z in the week or they did use

content Z. And kind of like just have a yes, no type of

approach and see that like currently at this stage your

model has, I don't know, like a 35 accuracy or

percentage or that's a been low, maybe like 75 percent

accuracy rate, that 75 percent of the content you

recommend is actually indeed consumed by the users.

Kirill Eremenko: And then I would set up a system that would track

that, that even autonomously it could track that and

say, all right, so how is our model going month to

month? Maybe you'll see some seasonality in that,

which you could possibly be able to explain, but

maybe with time you will see that, oh that's

interesting, before it was 75 percent, now it's dropped

to 74, now it's 69, now it's down to 65. And if you see a

consistent trend going downwards, that means

something's going on with your model. And possibly

that could be a shift, not just a seasonal shift, but that

could be a shift in the demographics of your

population. Maybe a change in some kind of legislation

around what students have to learn, what they

shouldn't learn. Maybe a change in the available


content on your platform or available content outside

or some influence from competitors or advertising

agencies that people are looking to other stuff that

your model couldn't have possibly taken into account

at the time you created it because that event or that

environment was not there at the time.

Kirill Eremenko: And so I feel that like tracking, like that dynamic

tracking or constant tracking through time is

important because it allows you to fish out these

changes that you can not otherwise see, right? So you

need a fly, you need kind of an indicator that

something is going on in the industry or in our

platform, in the business, in our audience that we

need to address and we need to either retrain the

model, rebuild it or so on. Because if like if that's on

track then it actually can be a bit too late. What do

you think on that?

Brian Dowe: Thank you so much for all those insights. I think those

are some really good ideas. I know for me, this was a

big first step for me into this space. And yeah, it's

really helpful to hear about other ways to push this

further from someone such as yourself who has more

experience in this field. And so, yeah, I think I'm going

to be walking away from this with a lot of things to

think about and a lot of interesting ideas to bring up

to my colleagues tomorrow.

Kirill Eremenko: That's awesome. That's awesome. Well thanks for

sharing. I think this has been a good case study for

our listeners in terms of the whole process, not just

creating an algorithm or a model that solves a

problem, which is awesome, but actually then


deploying it and then considering maintaining it.

Those are like all three are extremely important in the

whole life cycle of a data scientist or a data science

project.

Kirill Eremenko: All right. Well, now let's shift gears a little bit. Let's

talk about ... you mentioned learning data science, you

mentioned that how you got interested in the field and

actually like the Udemy recommender system brought

machine learning to you and then you got into it. Tell

us a bit about the learning curve, how difficult is it to

learn data science for somebody who's coming from a

developer background?

Brian Dowe: I think it depends a little bit on what you want to do

with it and specifically like how deep you want to dive

into, like for example, the underlying mathematics and

stuff that drives a lot of these concepts. And to that

end, I will say I think it's easy or easier to get to a

point where you can apply these models quickly as

opposed to if you were to like say try and build a

certain model yourself from scratch, like that would

take a little bit more time.

Brian Dowe: But I will say coming back a little bit to machine

learning A to Z, I think that course did a really good

job of empowering you to get started with applying

algorithms quickly. Like after the data pre-processing

section, it basically gets right into here's how you can

apply a linear regression model, here's how you can

apply a support vector machine. And you see from

doing a bunch of models like that, that the actual code

to apply the model if you're using a library like Scikit-

learn, is not that extensive.


Brian Dowe: And so I think getting to the point where you can at

least start applying models and start working with

them, you can do that very quickly. I think once you

enter that phase, transitioning from that into applying

these models in the real world can be a bit steeper of a

learning curve because of the data pre-processing

component. I think in a lot of courses that you'll see or

just like data re-arranged in a fairly neat format,

relative to how you would actually find it in the real

world. And I would even go so far as to categorize it as

a separate skill from modeling that you need to bring

together with modeling in order to apply data science.

Brian Dowe: And you could spend a ton of time sort of as like I've

been trying to do more recently with Kaggle data sets.

Just looking at a random data set that you find and

thinking, okay, how can I arrange this and get it ready

to be fed into a model just where I have all the features

that I need, they are scaled or converted appropriately

into what they need to be like for categorical data like

one hot encoding it to make sure it can be fed into a

model. You can spend a lot of time in just diving into

the data pre-processing aspect. And I think there can

be a learning curve there, but I think if you just push

yourself to start tackling these problems and just dive

in, just look at a random data set and just start

playing around with it, like sort of look up what you

need to as you go for like how you manipulate a

dataset with a given tool like Pandas for example.

Brian Dowe: And if you do that and just practice with different

datasets in different contexts, you can gain a lot of

skills fairly quickly to at least be able to not be


paralyzed by the idea of seeing this data set that you

feed it into a model right away and you get a bunch of

errors back. That can be very discouraging to someone

who is just starting out and doesn't really know what

to do when previously they've just ran a command that

worked perfectly because they had a dataset that was

pre prepared.

Brian Dowe: So the kind of synthesize a little bit, I think that it can

be easier to start learning about how the models work

and apply them on clean datasets and a little bit

harder to get to the point where you can really use

these effectively in practice. But it's still very much

attainable, it's very much feasible if you just push

yourself, apply yourself and dive in, start tackling

problems, even though it can be a bit nerve wracking

at first, it's still a really great learning opportunity,

especially when you're struggling with all these

challenges and pushing yourself to figure it out. I

think that's when some of the deepest learning

happens.

Kirill Eremenko: Totally agree. Through those challenges. And what

would you say is ... like you mentioned some of the

difficult things, especially about data preparation.

Totally agree on that. What are some of the things that

actually keep you going? Have you noticed or naturally

your successful implementation of the Apriori and how

it was used in the business, that was something really

cool that definitely gave you a massive push. Was

there anything else along your journey that you

noticed that once you get to these milestones, that

gives you the inspiration to keep going because that


could be helpful for some or all listeners who are

considering taking this approach?

Brian Dowe: Yeah. So yeah, getting the Apriori live was definitely a

big one to actually see that it was possible to take

something through to production. I think for me I sort

of felt frustrated by the fact that I knew how to apply a

lot of models without diving into the deeper

mathematics. And I really wanted to understand, okay,

like what is gradient descent actually mean? Like I've

seen it covered, I understand the general idea of what

it's doing, but how does it actually work?

Brian Dowe: And something that I did actually very recently is went

through and actually by hand wrote out all of the

calculus to derive gradient descent for a linear

regression problem. And then I did the same thing to

derive the back-propagation algorithm for neural

networks. And I think when I did both of those things

and realized the connection between the two and sort

of saw that the back-propagation algorithm is an

application of gradient descent on neural nets in the

same way that gradient descent works on a linear

regression or at least with the same conceptual

grounding, that was really like a big breakthrough and

understanding for me where I actually ... it's like, what

does it actually mean to train a model? It's like, okay,

you feed it data and it learns how to predict based on

that data, but like what does that learning actually

mean?

Brian Dowe: And then getting on top of gradient descent, I think at

the core of how many algorithms learn, and for me,

that was huge and just ... I think so many things sort


of clicked and came together and made more sense

after I did that. I would definitely recommend that,

maybe not like a first starter activity to do, but after

you've like played around with some models, have a

general idea of like what a model is trying to

accomplish and what training is trying to do, diving in

and trying to understand what's going on under the

hood can make it a lot ... I guess booster competence

when looking at ... it can be sort of disorienting to look

at one line of code and know that that line of code is

the entire training for a machine learning algorithm. It

can be hard to look at that and understand what's

really going on if you like haven't like for example,

done some of these derivations or deep dives into the

underlying mechanics of how everything works.

Brian Dowe: And so, I would definitely recommend doing that at

least like once or a couple times to help yourself

understand those things and it gives you a lot more

understanding when applying the models in practice,

even if you end up using the single line of code

training, it's like you'll understand in a more deep way

what's going on there.

Kirill Eremenko: Gotcha. Very, very interesting advice. Thank you for

sharing that. And yeah, interesting. I think everybody

has ways to get inspiration for pushing themselves

forward and for learning more. That's definitely a valid

suggestion for our listeners as well. One more thing I

wanted to ask you is what kind of tools would you

recommend starting with for somebody who's just

getting into this space or considering getting into the

space?


Brian Dowe: For me, a Python was definitely my language of choice.

I know that like R is also very, very big in the space.

But for me, I had already done some work in Python so

I was a little bit familiar with it. And I do think it's um,

it's a good place to start. Scikit-learn, which I've

mentioned a couple of times is, I think, a really good

library to start with as well because there are a lot of

different machine learning models that it accounts for.

So it can be really easy to swap out different models,

like try different things and sort of see what predicts

and better.

Brian Dowe: Pandas is really important for Python, for working with

your dataset. And I think a lot of times I've seen it

used in tutorials, it's usually just use for like

importing the dataset, but there is a lot more that you

can do with it. Like you can do join sort of in a similar

way that you would do in Sequel. I think Sequel is also

really important to know on a practical level, that's

something that I'm working to build my skills in

because a lot of time your data can be stored in

different databases, maybe something in a sequel

database, maybe something in a no sequel database

like Mongo db, and a knowing how to run queries in

any given situation where your data might be.

Brian Dowe: So you can pull data from different sources and sort of

arrange it in a way that's comprehensible is definitely

valuable. And I think a Sequel is definitely something

that you would see a lot in practice and Mongo or like

no SQL databases are picking up a lot of steam as

well. So understanding basic queries to pull your data.

Once you get your data into, like for example, my case


like a Python script, Pandas can be really useful to

manipulate the data in any way that you couldn't or

didn't with the previous methods before it gets into

your script.

Brian Dowe: Scikit-learn great for modeling, NumPy is sort of like

just a basic use case in Python that ends up working

really well with a lot of the other tools that I've

mentioned. So that is very valuable. If you are looking

to get beyond or get into the spectrum of neural nets,

[Charisse 00:59:28] is really useful. That's something

that I've done a decent amount of work with in my

spare time. And you can create like a neural net in

just a few lines of code and swap out different

techniques or different like last functions or activation

functions to change the structure that you're working

with really easily. So, it's a great tool for people who

are starting out with neural networks to sort of try out

different things and see how different structures of

your network can achieve different results. So yeah,

those are probably some of the tools that I've worked

with the most while trying to teach myself, and I'm still

very much a novice with a lot of these things. But it's

been a very useful way for me to get started and dive

in.

Kirill Eremenko: Gotcha. Well, thanks so much. That's some great

advice and great list of tools. And Brian, we're, uh,

we've gone to the end of the podcast in terms of time.

Thank you so much for coming on the show and

sharing all these insights. I think it was really cool, we

discussed your journey through a development or

software development to machine learning, the


different stages in building, deploying and maintaining

a model and what we just talked about getting into the

space of data science, how easy it is or what

challenges people face. Before we wrap up, I did want

to ask you, what is the best way for our listeners to get

in touch with you to follow your career on and maybe

learn more things from you?

Brian Dowe: The best way to get in touch with me would be a

Linkedin, uh, just a linkedin.com/in/Brian-Dowe.

Yeah, that's the best way to get in touch with me. If

you message me on there, I see and respond to all my

messages on Linkedin.

Kirill Eremenko: Awesome. All right, cool. Well that's where you guys

can find them, Brian. And before we finish up today,

one last question, what is your favorite book that you

can recommend to our listeners to empower their

careers?

Brian Dowe: One thing I've already talked about a lot is Machine

learning A to Z and I just want to ... Well, first off,

thank you so much Kirill and also to Hadelin for

putting that out there. It was the course that got me

started on this journey and I can't thank you enough

for that. But specifically why I'd recommend this, it

gives you a lot of tools to work from. You learn about

not only how to code a lot of algorithms, but in Kirill's

intuition tutorials, he talks a lot about good use cases

for the algorithm, he gives an overview of how the

algorithm works and what it's doing under the hood.

Brian Dowe: And in my case, being able to identify the Apriori

algorithm as the best tool to solve the problem that I


was working on. It was directly connected to Kirill's

explanation on what that algorithm is used for. And so

I think even if you are just starting out, what this gives

you is it gives you a roadmap of different tools that you

can use, different algorithms that you can work with

and the use cases that they apply to. So it empowers

you to be able to at least look at a problem and think,

oh, I can think of some algorithms that would fit this.

Even if it's as simple as, oh, this is a regression

problem versus a classification problem. It's like, okay,

well now I've narrowed down the list of things I need to

look into that could be a potential solution. And I was

very pleased with how quickly I was able to do that

after taking that course. And so, especially if you're

just starting out in the field, I would highly

recommend it.

Kirill Eremenko: Thank you so much. That's so nice to hear. And you

actually gave me an idea because like usually for this

question, I expect a book and you recommended a

course and I actually thought of we should release this

course as a book. Like it should be a book machine

learning A to Z. I think that would be pretty cool for

people to get their hands on as well, like a supplement

thing. Thanks for the idea anyway, I'll talk to Hadelin

about it.

Brian Dowe: Sounds good.

Kirill Eremenko: Yeah. Well, thanks so much for coming on the show.

It's been a huge pleasure and I'm sure a lot of listeners

got so much value out of it. Once again, thank you so

much.


Brian Dowe: Yeah. thank you so much for having me.

Kirill Eremenko: So there you have it, ladies and gentlemen, I hope you

enjoyed this podcast. Quite a lot of different topics that

we discussed and very inspiring career path. I'm sure

you'll agree that Brian is creating for himself how he's

leveraging the skill he has to be better at data science,

how he's leveraging data science to help of the work

that he's doing, how he's bringing data science and

machine learning into the company where he's

working and helping them create better products and

services or service their customer base even better and

derive value out of their data even more efficiently.

Kirill Eremenko: My personal favorite part of this podcast was the

discussion we had about modeling, the development

deployment and maintenance parts or stages of any

models lifecycle. And personally, I really enjoyed that

whole brainstorming part we had about maintenance

and how you need to think about maintenance and

how you can actually go about maintaining a model. I

think we came up with some interesting ideas. And

just in general, the whole process, you can see the

whole of like discussing things with your colleagues

and peers can be very helpful in terms of sharing ideas

with others but also coming up with ideas in the

process. And I think that's what we saw in this

podcast.

Kirill Eremenko: On that note, you can find all of the links to this

podcast at www.superdatascience.com/215, and that's

where you can also find the URL for Brian's Linkedin

and connect with him there and make sure to follow

his career and see what he gets up to and maybe share


a message if you have any questions about modeling, if

you have any questions about integrating data science

and machine learning into your careers in a similar

way that he did.

Kirill Eremenko: And there we go. Hope you enjoyed this podcast. If you

did, make sure to leave us a review on iTunes. That

would be very helpful for us to spread the word and

get more people involved so that they know that there

are valuable insights that they can pick up from this

podcast and from our guests. Apart from that, thank

you so much for being here today and sharing this

hour with us. I can't wait to see you back here next

time. And until then, happy analyzing.


SDS PODCAST EPISODE 215 WITH BRIAN DOWE - Amazon Web …€¦ · full-stack web developer Brian...

Documents

Transcript of SDS PODCAST EPISODE 215 WITH BRIAN DOWE - Amazon Web …€¦ · full-stack web developer Brian...