SDS PODCAST EPISODE 215 WITH BRIAN DOWE - Amazon Web …€¦ · full-stack web developer Brian...
Transcript of SDS PODCAST EPISODE 215 WITH BRIAN DOWE - Amazon Web …€¦ · full-stack web developer Brian...
Kirill Eremenko: This is episode number 215 with a full-stack web
developer Brian Dowe.
Kirill Eremenko: Welcome to the SuperDataScience Podcast. My name
is Kirill Eremenko, Data Science Coach and lifestyle
entrepreneur, and each week we bring you inspiring
people and ideas to help you build your successful
career in data science. Thanks for being here today
and now let's make the complex simple.
Kirill Eremenko: Welcome back to the SuperDataScience Podcast, ladies
and gentlemen, very excited to have you on the show
today. Today we've got an aspiring data scientist and
full-stack web developer Brian Dowe joining us, and I
literally just got off the call with Brian a few hours ago
and what I can say about this episode is it's very
inspiring, especially if you are a web developer yourself
or a developer of any kind yourself. You will find a lot
of useful tips and insights in this episode.
Kirill Eremenko: If you're not a developer, you will also find a lot of
useful tips. But I personally found a very, like Brian's
example of how he structured his career, very
insightful and something to admire and kind of like
dissect, and that's exactly what we did in the podcast.
So you'll find out ... like we touched on three main
things. So first of all we talked about what it's like to
go from a developer to data scientist or in fact how to
integrate data science in your career if you are a
developer, if you work on web development or any kind
of development, how to integrate data science in your
career. In fact, I think anybody looking to integrate
data science will find a lot of these tips useful.
Kirill Eremenko: Then we talked about models. We talked about
developing models, deploying models in business and
maintaining models. In fact, we're going to dissect the
whole case study of a recent model that Brian has
worked on using that Apriori algorithm, and you will
hear some back and forth between us about the
development, the deployment, and the maintenance
life cycle. We'll actually come up with some ideas on
the podcast, which you might find quite interesting in
terms of brainstorming and how to think about
modeling. And in general you'll get a lot of takeaways
about modeling.
Kirill Eremenko: And finally, we talked about getting into the space of
data science or how, what it's like to learn data
science, what are the challenges in the learning curve
that Brian has been facing. He's been a learning data
science since the start of this year, so for almost a year
plus a minus a couple of months. And you'll also hear
about the tools, some of the tool recommendations
that Brian has for you if you're just starting out into
data science.
Kirill Eremenko: So there we go. This podcast quite interesting in terms
of the three pillars that we discussed. We get a lot of
value from here and without further ado, I bring to you
Brian Dowe, full-stack developer and aspiring data
scientists.
Kirill Eremenko: Welcome to the SuperDataScience Podcast, ladies and
gentlemen, today we've got a very exciting guest on the
show. Brian Dowe calling in from San Mateo,
California. Brian, how are you going today?
Brian Dowe: I'm doing good Kirill, how are you?
Kirill Eremenko: I'm doing very well. Thank you very much. It was such
a pleasure to catch up at DataScienceGo mate, it was
exciting to hear your story, I can't wait to share it with
our audience today. But first off, how do you feel at
the events and how do you feel after it?
Brian Dowe: It was really an incredible experience. There were so
many fantastic presenters. There were a lot of people
that I sort of networked with and talked to outside of
the sessions as well, and got a lot just all around out
of everyone I interacted with. It was an amazing
experience, and I learned so much and it's given me a
lot of great jumping off points moving forward.
Kirill Eremenko: Yeah, thanks. That's really great to hear and you just
for the podcast, you mentioned you were already
getting into like you made this tilt or shift in your
trajectory, and you started doing stuff on Kaggle, after
DataScienceGo, you started the Andrew Ng machine
learning course. Quite a few things have happened for
you. What would you say has been the biggest shift
after attending DataScienceGo 2018?
Brian Dowe: I think that before I attended I had had some
experience with applying machine learning models and
sort of an easier setting where the Dataset is pre-
prepared for you and it's pretty clean, and you just got
to get right to the modeling. And so one thing I wanted
to do following up from DataScienceGo is push my
knowledge a little bit further. So I actually went
through and did like the derivations and the calculus
for gradient descent and that made a huge difference
for me in just understanding what's going on under
the hood.
Brian Dowe: And so even if the ... it can be more convenient to use
libraries like Scikit-learn, Intensive flow to do projects,
it's still really helpful to understand what's going on
behind the scenes so that you can work with those
tools more efficiently. So that was really ... yeah, that
was really huge for me.
Kirill Eremenko: Awesome.
Brian Dowe: So with Kaggle, I've been going through just some of
the datasets that they have posted and trying to make
some like predictions even with just like basic models,
but that's given me some good practice in data pre-
processing and organizing a sort of not clean dataset
into something that can be fed into a model and that's
been some really valuable experience for me.
Kirill Eremenko: Fantastic. Great to hear. Amazing. One more thing on
DataScienceGo, I was curious about this, what was
your favorite talk? Who was the speaker who gave you
a top talk?
Brian Dowe: I think one of my favorite ones was that Gabriela de
Queiroz from IBM. Her talk on a deploying or I think it
was deploying machine learning models in five
minutes or something like that-
Kirill Eremenko: Deep learning. I think it was deep learning models in
five minutes.
Brian Dowe: Yeah, exactly. And just she showed us the model asset
exchange that you could use to sort of find a bunch of
prebuilt and pre-trained models and start deploying
them very quickly. And I thought that was really
interesting and walked away from that with some
items on my to do list to go through a lot of resources.
I really enjoyed her talk a great deal.
Kirill Eremenko: Fantastic. Well that's very exciting. Well Brian, very
cool to have in you the show and one of the reasons is
because I am excited, super excited about your career
path. I think you have a very inspiring journey that
you've created for yourself, and I would love to share it
with our listeners. And what I mean here for our
listeners is that what you need to know about Brian is
that he is a full-stack developer, web developer, and
we'll talk more about that in a second. And Brian sees
the value of data science, sees the value of machine
learning and deep learning and is actively applying
that in his career.
Kirill Eremenko: And I know that's across our audience, across the
listeners in our podcast, a very large percentage of you
guys, I don't know, maybe 30, 40 percent, that's my
rough estimate, it might be even more, are people who
are developers who are also looking to get into data
science or are already in data science, have
transitioned into data science or have seen the value of
data science. And I think it is ... this is going to be an
inspiring story for you guys to model in your careers.
But even if you're not a developer, the steps that Brian
has taken to integrate data science in his career
without fully jumping straight into it and quitting his
job and just going data science, data science, is quite
inspiring. So I think that's going to be cool. So I'm very
excited to dig into it. How are you feeling about this
Brian?
Brian Dowe: I'm really excited. Yeah. I'm excited to dig into this as
well.
Kirill Eremenko: Awesome. Some good self reflection opportunity for
you, I guess.
Brian Dowe: Yeah.
Kirill Eremenko: Awesome. Okay. So tell us a bit about you Brian. You
are a web develop, a full-stack web developer. If
somebody off the street were to ask you, Brian, what is
it that you do? What is a full-stack web developer do?
How would you answer that question?
Brian Dowe: Sure. So generally when people ask me that because
it's come up a decent amount in conversation, I'll say
that full-stack is a combination of front end and back
end technologies. And in my specific case, I worked for
a company called Education.com, it's a web
application. And so for web applications, front end
usually has to do with building templates. So all the
html and CSS are the parts of the site that users see
visually and also some Java script for interactive
components. And the backend side is like the database
and APIs that interact with your database and grab
data to display to the end user. So the front end would
be like what you see when you look at a webpage, and
the back end would be like what happens behind the
scenes when you click a button to submit a form, like
where's that data going? What is it doing? Most of that
is handled by back end technology.
Kirill Eremenko: Gotcha. How long have you been with education.com?
Brian Dowe: It's been about nine months now, so not very long. I
interned for four months, and I've been full-time for
about roughly the last like five and a half months.
Kirill Eremenko: Okay. Gotcha. Tell us what attracted you in data
science, like why are you on this podcast, how did you
get into this, hear about data science and what the
next steps did you take from there?
Brian Dowe: I actually first learned about data science, Kirill,
through your course on machine learning A to Z. And I
started doing it maybe about two or three months after
I had started working for education.com. And prior to
that I had built some applications just to sort of teach
myself that were, had like very simple data
components, like maybe just a user database and
maybe like the ability to make blog posts or write
reviews, etc.
Brian Dowe: And when I stumbled across machine learning A to Z, I
had heard of the field, I didn't really know too much
about it. And I remember watching your introductory
video where it kinda goes over a lot of the applIcations
and use cases and machine learning. And I thought,
oh, this is really cool this provides a really interesting
way to look at data, gain insights from it and improve
the actions that you can take on the basis of that. And
so that was really interesting to me, and I think as I
progressed through the course, I found it to be more
accessible and something that I could jump into even
though I didn't have too much experience. And yeah, it
just kind of progressed from there. And my interest of
it has only grown over time.
Kirill Eremenko: Interesting. Let's rewind a little bit. Tell me this, how
do you stumble across a machine learning A and Z
course, what were you searching online for? Obviously
there was some kind of a need that you were trying to
fulfill when you saw it. Like people don't normally
stumble upon machine learning A to Z unless they're
actually looking for something related to data science,
like what was the initial trigger for that to happen?
Brian Dowe: I had been using Udemy for a long time before that for
projects that were [inaudible 00:12:25] specifically to
data science. So I mentioned before how I was trying to
get started by building a simple application to develop
my development skills basically. I had taken some
courses at that time on building a clone of YELP with
ruby on rails, and then I found one on building a price
alerts app with Python and flask. And so throughout
the course of my learning, my Udemy feed or the
courses that popped to the top had to do a lot with app
development and with programming in general and
with Python also, specifically over time. And then I
think it was through that your course just sort of
popped up on my list.
Kirill Eremenko: Okay. Like you got into machine learning and data
science because machine learning and data science got
you in there in the first place. That's what it sounds
like.
Brian Dowe: Yeah. Sort of, yeah.
Kirill Eremenko: The circle has closed, right? We've gone full circle, this
is so cool, right? Like you're studying exactly the stuff
that has influenced your career to study that stuff.
This is like Inception level type of thinking, right? Have
you ever thought of it that way?
Brian Dowe: I haven't, but that's really interesting now that you put
it that way.
Kirill Eremenko: Oh wow. That's so cool. That's so cool. One of the
craZiest stories. Okay. So you got into machine
learning and you started taking the course and at the
same time, how were you able to apply this at work?
Like you're a full-stack developer, and this is where
the interesting stuff starts to come in because I know
your story a little bit already. How were you able to
take that into ... as a developer, if I'm a full-stack
developer, I might be a bit like shy or I might be a bit
... not even come to mind that I can bring this stuff,
machine learning to work. It's not applying. It's
completely unrelated to my role. So how did you go
about that?
Brian Dowe: That's a really Interesting question and I definitely did
feel that like not knowing exactly where to start, where
the right opportunity was and it happened in a very
roundabout and sort of like by chance way. Our
company, we try to hold at least one hack week every
year. And for those who might not be familiar with
what that is, it's generally where everyone can sort of
come up with their own projects that they want to
build outside of the development pipeline and then
people can team up and then just try to build whatever
they want and then we all present to each other at the
end of the week. And by the time hack week came
about, which was about like may of this year, I had
been studying machine learning maybe for like two to
three months.
Brian Dowe: I had talked ... like briefly mentioned it to some people
at work in passing. And then when hack week came
about, one of my colleagues approached me and he
wanted to build a recommendation system using
machine learning. And that was basically the extent of
the background for it. And, yeah. So I remember
sitting down with him sort of trying to brainstorm
what to do ... before I get too much deeper into that,
this hack week was really the catalyst for a lot of the
professional application of machine learning or in my
workplace to happen just because it provided an
opportunity to just build a project that you're
passionate about without any other restrictions. I
thinK looking back, that really empowered me to start
bringing up ideas and trying to make things work
outside of that context and just do it as part of our
normal pipeline of projects.
Brian Dowe: But if I could, looking back, make a recommendation
to someone in a similar position, it would be to not
hesitate to just dive in and start trying to find
problems to solve. Because I think a lot of full-stack
developers have access to a wide breadth of data
because you have to work with it when you build your
applications. And so I think just sort of diving into the
data that your company has and then trying to figure
out, okay, what's a way that I could use this? Like
what's something that I could predict based on certain
features or what value can I extract from this that
could be presented to the end user in some way.
Brian Dowe: I think my recommendation would be to just dive in,
start thinking about problems to solve and when you
have something or an idea of how to go, odds are that
the other developers you work with will be intrigued by
this and curious about this. Do you have an idea that
could potentially bring value to the company? I think
like trying to empower yourself and just dive in and
start looking for problems to solve is one of the best
ways to get started.
Kirill Eremenko: I totally agree with that. And I just wanted to mention
two points here. First one is that, you also obviously
need to be careful as a developer. It depends on
company to company, but often you do have potential
access to data, but for instance, in Facebook, if you
use it for the wrong reasons, you'll probably get fired.
So kind of like, be sensible about that and maybe
confirm with your boss if you can use that data or
maybe if you, if that's a bit too early to do, maybe
create some dummy data sets that are similar or you
have The similar columns in terms of structure to the
company data set, but like play around or download
some data sets in your free time to play around with
before you actually play around with real data that is
customer base, especially if it's sensitive data. That's
one comment I just wanted to make.
Kirill Eremenko: But in general, indeed, like even if your company
doesn't have these hack weeks or hackathons, which I
think are very useful especially, but they usually take
place in large organizations that want to inspire
innovation. So if your company doesn't have that job,
maybe you can talk to the management to start
introducing it. But even if that's not the case, you can
still play around with data or dummy data or similar
data and see how you could potentially bring value to
the business. And you can still bring those results and
those suggestions to your management or to the
company leadership and present it to them. Every
company in this day and age wants to be data driven
or model driven. They will jump on top of it. If you can
add value to the bottom line of the company, it is a
very rare instance that management, especially when
it's concerning technology and data science where the
investment isn't that high, but the return on
investment can be massive. Very rare do you get cases
where management turned down those ideas, turned
down those innovations.
Kirill Eremenko: So as long as you're proactive, don't let the absence of
a hack week or a hackathon get in your way. Or if for
instance, you're a hack week is 11 months away, it
happens once a year. If it's 11 months away, don't
wait, you can already do it now. Those are just some of
my thoughts on that. Brian, you were talking about
you were preparing for this hack week and a colleague
of yours came up to and then you got into this project.
What was this whole idea? You said you were going to
dive deeper into that topic.
Brian Dowe: Sure. So the project was a recommendation system for
our content. So just maybe to give a little bit of
background information on what we do in
education.com, we're a platform for parents and
teachers to come and find resources and teaching tools
to help their kids. And so when I say like a
recommendation system for our content, I mean we
have just a bunch of worksheets and games and
lesson plans and all of this stuff that parents and
teachers can come and use.
Brian Dowe: So what we wanted to do is create a way or improve
the way that we recommend other resources that you
can use based on what you're viewing. So if you like go
to our website and click on a worksheet, you'll see at
the bottom, here are five other worksheets that are
related to this one that we would recommend for you.
And so, yeah, so just to give a little bit of background
on that.
Brian Dowe: But anyway, so a colleague approached me, his name
is Yon Burke, a brilliant engineer. So he approached
me wanting to build this recommender and as we were
sitting down and sort of talking about what some
different angles are that we could go about this with,
like what data would we use, like how I'd rearrange it,
like what model would be needed? He brought up
something or he said, "What if we're looking at this
problem the wrong way, what if this problem is people
who downloaded this also downloaded this" Or like,
what if that's the way that we're going to look at it.
Brian Dowe: And as soon as he said that, a light bulb in my brain
went off and I thought, I know that I learned of an
algorithm in machine learning A to Z that's perfect for
solving this exact type of problem. So I went through
and I looked through all my notes and it turned out
that that algorithm was the Apriori algorithm. I know
that I made that connection because the example given
was using that algorithm for a grocery store to try and
figure out a where to place their products in relation to
each other based on what products most users or
most customers would buy together. And so I
remember it just ... like light bulb moment went off
and made that connection, and then I knew that was
the place where we had to start based on the way that
we had a scoped the problem now.
Brian Dowe: So what I did was I went back and I looked at the
slides from machine learning A to Z from that section
and In those slides, Kirill, you went over the equations
for support competent and list for that algorithm. And
then using those equations, I scripted or I just
converted that into code basically in Python and then
set up the structure to loop through all of our data
and gather the information that we needed.
Brian Dowe: And Yon, the engineer, I was workIng on this with did
the pre-processing portion and then got me the
dataset. And then we started testing and trying to
figure out if it was going to work, if it was gonna give
back good predictions because one of the attributes of
this algorithm that made it a little challenging to figure
out whether it was going to produce good results is
that we didn't get the immediate feedback on whether
it's a correct or incorrect estimate, for example, like
you would with a supervised machine learning
algorithms.
Brian Dowe: So we had to look at some of the results by hand and
figure out, okay, is this producing the results that we
wanted, and what we were specifically looking for was
to have sort of more loosely defined recommendation
than we currently had for a given worksheet because
let's say you looked up like a two digit multiplication
worksheet on our site, what we had before, it would
have just given you like three or five [inaudible
00:24:28] multiplication worksheets.
Brian Dowe: And we were thinking, okay, like how useful is that
really for someone who is coming to our site and
looking for another jumping off point. Like if they find
a worksheet, are they just going to want to find more
worksheets that are exactly the same? Probably not.
What might be more useful is like maybe they're
looking for like two digit division, something that's
related to what they're on, but not exactly the same.
And so we were hoping that by using a user download
histories, instead of just like a content tagging or like
what the exact subject was, that we would get some
better results that would actually be more beneficial to
users.
Brian Dowe: And as we went through, that turned out to be the
case and we saw that we were getting the
recommendations that we were looking for. And then
after we presented at the end of the week, it started
picking up steam and enthusiasm and then we
decided to run it on the site as an AB test. And after
running it for a while, it turned out that it won against
our former recommendation system, and then we
pushed it to production and it became our first
machine learning model that was deployed on the live
site. So that was a huge win and a really exciting first
step in implementing machine learning. But the first of
many is what I'm very much hoping for and striving to
make the case is that this is just the first step on a
longer journey to integrate a more machine learning
into our operations. And so yeah, it's very exciting
stuff.
Kirill Eremenko: Wow, that's awesome. Congratulations. That's a huge
project and most importantly that it actually got
implemented into the business and added value, that's
just really cool to hear. And just to recap, so let me
know if I'm getting this right. So before the
recommender system was recommending more content
to your users based on the tagging. So for instance,
you have these multIplication sheets and maybe your
tutorials, maybe some videos, some other content that
you have all sorts of content, they're all tagged, like
this is multiplication, this is for first grade, this is for
fourth grade, this is math, this is this other topic and
so on. And based on the tagging, your similarity of
tags, it would recommend something.
Kirill Eremenko: Where's your new recommender system, which is
through the Apriori algorithm, it would look at not just
the tagging, but it would look at what are users
actually downloading. So if people on average
download X, what did on average they download after
that or what did most of them download after that? So
you're looking at the behavior of your users and based
on that you're saying, okay, well even though it's not
tagged identically, looks like that's what people want,
that's what people are after, and we're gonna use that
as a suggestion. And so that Apriori algorithm
approach work better than your previous
recommender method. Is that about right?
Brian Dowe: Yeah, that's essentially the gist of it.
Kirill Eremenko: Gotcha. Sounds a lot like Amazon. Like you go on
amazon and you buy something and then out of the
blue they recommend something else, it might not be
related at all, but that's because most people kind of
like after they bought that one thing, they usually go
searching for that other thing; the people that are
similar to you, I guess.
Brian Dowe: Yeah.
Kirill Eremenko: Gotcha. And then after that you did a quality
assessment like not a quantitative assessment, as you
mentioned because you didn't have like with
supervised learning algorithms, you couldn't say yes,
no, correct or incorrect. In this case, you did a
qualitative assessment where you just went in and you
tested out a few of those recommendations to see if
what the recommendation was actually made sense to
you. Like the example you gave like a three by three
multiplication table instead of getting a four by four
one, you get a division table, which kinda like makes
sense. Is that what you did next?
Brian Dowe: Yeah, that's essentially what we did. We were looking
for things that were like not exactly the same but still
would likely to be within the same subject area that
might be closely related, for example, yeah, like that
example that I gave hoping that users would be maybe
studying those two subjects at the same time and they
could find things that would be more directly related to
like where they would go next from a given subject or
something that they might be also practicing at the
same time.
Kirill Eremenko: And so how long has this system been in place now?
This a recommender algorithm?
Brian Dowe: It's been a couple months now that's been up. The
hack week was in May and after there were some
additional steps that needed to be taken to a to set it
up and running with our whole dataset and that took
some time. But yeah, for a couple months now it's
been up and running and live.
Kirill Eremenko: How are the other results? How is the management
seeing, are they seeing some positive impacts? Are
they happy with how like this change, what has
change has brought?
Brian Dowe: Yeah, it has brought a positive change, but we're
definitely also looking to see how we can improve it
because this was just sort of like a first run like it did
win the AB test, but we were still looking for ways to
improve and to provide more valuable
recommendations to our users. And I think, yeah, that
this was a good first step to at least identify resources
that are being used and consumed most often. But I
think there's so many places that this could go and
there's so much that we could still do to improve our
recommendation capabilities and that's something
that we're diving into right now.
Kirill Eremenko: That's so cool. And I can just imagine the boost of
morale that you got from that, like when your
algorithm got implemented, how did that feel?
Brian Dowe: It was incredible. I think one big takeaway that I had
from that is that when I was first studying machine
learning, sort of leading up to that, and I think still a
little bit following from the implementation of that
algorithm. My focus was I wanted to learn all these
like, cool, cutting edge tools like neural nets and like
Gans and stuff like that. Like all this some of the more
advanced modeling tools, but the process of actually
sitting down, looking at what data we had available
and trying to choose a model that fits best with that
sort of showed me that you have to start from the
problem and then try and find the model that best fits
the solution rather than just picking a cool model that
you have in mind that you really want to work with
like a cnn and then searching for a problem that fits
that solution.
Brian Dowe: I think that can be like a great way to learn about a
new technology like if you're focused on learning about
like cnn's and you look for problems as specifically fit
that use case, that's great for learning. But in practice,
you don't always get to choose the problems that come
to you, sometimes there's just a problem at hand that
needs to be solved and you need to explore your toolkit
and just choose whatever is best fitted for that
solution. Like sometimes I think I've seen cases or at
least read about vaguely cases where a bunch of
models were tried and a simple linear regression or a
simple logistic regression was the best outcome. And
sometimes that's the case.
Brian Dowe: Sometimes the model that you initially had in mind
that you were focused on, isn't the best tool for the
problem and something that maybe is a little bit
simpler and scoped is. And so I think that's something
that I thought about a lot is it doesn't have to be like
the craziest algorithm out there to make a really big
impact. And so that's something that I think about
going forward is just starting from the problem first
and then searching for the tool that solves that after
the fact.
Kirill Eremenko: Very, very wise words. Totally agree with that. It's very
easy to get carried away with a machine learning and
all of these new shiny cool things. And that's awesome
to learn them and they're very good to inspire you to
learn and to grow and to find like these different
nonstandard applications, but at the same time,
sometimes less is more. You just go for what solves a
problem and does that efficiently. And in your case it
was the Apriori, which is awesome to hear.
Kirill Eremenko: Another thing I wanted to ask you on this topic was
that the integration of the algorithm into your
company services. I think this is a very cool area that a
lot of the time is missed by data scientists that you not
only have to develop an algorithm, but you have to
also operationalize it. You have to make it work in the
company so you have to somehow put it into
production, it has to integrate with the website and all
those things. So I think that'll be really cool to talk
about that. Are you able to share some details with us
on how you went about it and what challenges you
faced along the way?
Brian Dowe: Yeah, I could share some details about that. A lot of
the bulk of the work to actually set this up on
production was handled by my colleague who has like
a lot more familiarity with the way that these systems
work. A big challenge was just being able to run all of
this data because we have so much data in user
download histories from like all of our users and all of
the things they downloaded it. It gets pretty big pretty
quickly, and so knowledge of a cloud computing
services became really helpful in this case. And that's
what we were able to use to do this to do this. And so
one thing that, a challenge that came along with that
is trying to decide when and how often the model we
need to be wrong. Because I think that's a question
that is really important is like do you need to make
actual calculations with this model on the fly or can
you run it like every so often and use the results or
store the results from that to display to the end user?
Brian Dowe: And so we ended up going with the second approach of
just running our algorithm on every so often to gather
this information like for a given worksheet or game
that a user downloaded or played, figuring out the top,
so many worksheets that are associated with it, which
is mainly the tasks that the algorithm was
accomplishing. We decided that it would be much
more efficient for us to just run this every so often and
store the results so that when a user visited the
worksheet, we can just pull it directly from our
database without having to actually run the
calculation on the fly.
Brian Dowe: And so I think that's something that can vary from use
case to use case and model to model depending on
how you're trying to apply it, maybe you will need to
actually run a prediction or a calculation on the fly
when a user clicks a button for example. And that can
be ... whether or not that's a feasible place to
implement your model, I think depends on the weight
of the calculation that it needs to make. And so if it
needed to make a calculation that required a lot of
processing time but the user needs the result
immediately, then that might not be a feasible way to
go versus if it's something that might not change
drastically from day to day or week to week, then you
could just run your algorithm every so often and then
store the results within a structure that you already
have set up to store data for example.
Brian Dowe: So I think it depends on the use case, it depends on
what you're trying to do and how often the results of
the model need to be updated. And I think yeah, there
are definitely ways to think strategically to integrate
systems like this without running massive programs
like every time a user clicks a button for example.
Kirill Eremenko: Okay, gotcha. Very, very cool advice as well that
there's different types of models and hope people are
taking notice of this, that sometimes you might need
to run them on the fly and get results all the time,
sometimes it's enough to run them occasionally. And
especially big companies with lots of data, they have
time slots for running things on the servers, so usually
they run ... like I was at a company after leaving
Deloitte where they would run things at night and
their every single hour and sometimes usually in 15
minute chunks or maybe even shorter chunks, the
whole night is split into these periods where you need
to apply to get server time locally or ... now more
things and more and more things are going to the
cloud, but still a lot of companies do things on
premise, do calculation on premise.
Kirill Eremenko: And so they, in order to run all these models and do
all the calculations, update all the data and stuff like
that, especially like for example, the financial services,
you gotta turn over all the things that happened
during the day, make sure everything is accurate,
reconcile a lot of stuff. That all happens at night and
you want to be very mindful of how you're using server
time in your company. But even if it's in the cloud,
every time you rerun the model, it's still going to be a
cost, it's still going to take some kind of budget or
something. So being conscious about whether you
need right away or not is quite an important thing. I'm
going to ask you this question, so I want to see in a bit
of a different space, have you taken the data science A
to Z course?
Brian Dowe: No, I don't think I have actually.
Kirill Eremenko: Okay. So the reason I ask is because in the data
science A to Z course we talk about model
maintenance. And I just wanted to see like ...
obviously this has been a big success and
breakthrough for both you and the company in terms
of developing and deploying this Apriori model. What
are your thoughts on how you're going to maintain it?
And the reason I ask is because I've seen situations
where models deteriorate. In fact, I've seen a situation
where a model used to be very effective, like it would
bring like 80 percent accuracy, but then over time,
over a period of 18 months it deteriorated to a level
where the accuracy was less than 50 percent, was like
42 percent or something. Meaning that it would have
been more efficient for the companies to just flip a coin
and do the recommendations based on that rather
than using a model. And so I'm just curious, what are
your thoughts on how to maintain this model?
Brian Dowe: Yeah, that's a really interesting question. I think to a
certain extent, this can depend on the domain that
you're in. And what I mean by that is I think like
different domains have a different pace to how their
data shifts structurally over time. And so for us, who
are in the education space, at education.com and that
kind of means that we follow the cycle of a school year.
So for example, given parent or teacher, based on the
time of year and based on when they signed up, like
maybe they sign up in like August and then from
August through June, they're using a bunch of second
grade resources. And maybe for a parent like they then
move onto the next grade and then they're going to be
consuming a bunch of third grade resources.
Brian Dowe: So, if we were to just store and continually update the
model just based on new data coming in and not deal
with eliminating old data, it would undoubtedly begin
to grow stale over time in the sense that you're now
will be making predictions based on users who had
been with the platform for several years. And so you
might not get the best recommendation for a second
grade worksheet because you'd be pooling from
download histories from a bunch of different grades as
well. And so, I think for us, that's something that like
already implemented as part of the algorithm is to
keep it seasonal and to try and look at data that's
irrelevant to the time period that we're in relative to
the school year.
Brian Dowe: And so I think thinking about things like that, like how
often ... like for us right now, like grade is the main
way that we look at this, but I think it could be
interesting to dive in and see how subjects covered
might change over time. And, yeah. But for now,
grades and school year is like the main data that we
have to go on because there's a lot of different
curriculums out there and our site is used very widely
and so it's a challenge to find a way to update your
algorithm in a way that meets the needs of all the
users on your platform. School year and grade just
happens to be one way that's pretty standardized
across our users, but I think trying to push that
further and make it even more relevant to what a given
user might be looking for is definitely something that
is a challenge that we're going to have to rise up to and
meet overtime. A really challenging problem.
Kirill Eremenko: Okay. Gotcha. That's a very cool consideration about
the seasonality. If I may, I'd like to give you another
suggestion, is that okay?
Brian Dowe: Sure. Yeah.
Kirill Eremenko: So one thing I was thinking about is with your model,
you could measure how well it makes those
recommendations. So you could say, all right, we
made all these recommendations, how much of those
recommendations was actually used and how many
cases, and what percentage of the cases was the
recommended content actually consumed by the user
to whom it was being recommended or consumed
within a relevant timeframe like it could be maybe they
didn't consume it right away, maybe they had to think
about it. But like within a week or a month they did
indeed consume the content because if you have the
right data points in place, you can actually collect that
information like we recommended content Z and they
didn't use content Z in the week or they did use
content Z. And kind of like just have a yes, no type of
approach and see that like currently at this stage your
model has, I don't know, like a 35 accuracy or
percentage or that's a been low, maybe like 75 percent
accuracy rate, that 75 percent of the content you
recommend is actually indeed consumed by the users.
Kirill Eremenko: And then I would set up a system that would track
that, that even autonomously it could track that and
say, all right, so how is our model going month to
month? Maybe you'll see some seasonality in that,
which you could possibly be able to explain, but
maybe with time you will see that, oh that's
interesting, before it was 75 percent, now it's dropped
to 74, now it's 69, now it's down to 65. And if you see a
consistent trend going downwards, that means
something's going on with your model. And possibly
that could be a shift, not just a seasonal shift, but that
could be a shift in the demographics of your
population. Maybe a change in some kind of legislation
around what students have to learn, what they
shouldn't learn. Maybe a change in the available
content on your platform or available content outside
or some influence from competitors or advertising
agencies that people are looking to other stuff that
your model couldn't have possibly taken into account
at the time you created it because that event or that
environment was not there at the time.
Kirill Eremenko: And so I feel that like tracking, like that dynamic
tracking or constant tracking through time is
important because it allows you to fish out these
changes that you can not otherwise see, right? So you
need a fly, you need kind of an indicator that
something is going on in the industry or in our
platform, in the business, in our audience that we
need to address and we need to either retrain the
model, rebuild it or so on. Because if like if that's on
track then it actually can be a bit too late. What do
you think on that?
Brian Dowe: Thank you so much for all those insights. I think those
are some really good ideas. I know for me, this was a
big first step for me into this space. And yeah, it's
really helpful to hear about other ways to push this
further from someone such as yourself who has more
experience in this field. And so, yeah, I think I'm going
to be walking away from this with a lot of things to
think about and a lot of interesting ideas to bring up
to my colleagues tomorrow.
Kirill Eremenko: That's awesome. That's awesome. Well thanks for
sharing. I think this has been a good case study for
our listeners in terms of the whole process, not just
creating an algorithm or a model that solves a
problem, which is awesome, but actually then
deploying it and then considering maintaining it.
Those are like all three are extremely important in the
whole life cycle of a data scientist or a data science
project.
Kirill Eremenko: All right. Well, now let's shift gears a little bit. Let's
talk about ... you mentioned learning data science, you
mentioned that how you got interested in the field and
actually like the Udemy recommender system brought
machine learning to you and then you got into it. Tell
us a bit about the learning curve, how difficult is it to
learn data science for somebody who's coming from a
developer background?
Brian Dowe: I think it depends a little bit on what you want to do
with it and specifically like how deep you want to dive
into, like for example, the underlying mathematics and
stuff that drives a lot of these concepts. And to that
end, I will say I think it's easy or easier to get to a
point where you can apply these models quickly as
opposed to if you were to like say try and build a
certain model yourself from scratch, like that would
take a little bit more time.
Brian Dowe: But I will say coming back a little bit to machine
learning A to Z, I think that course did a really good
job of empowering you to get started with applying
algorithms quickly. Like after the data pre-processing
section, it basically gets right into here's how you can
apply a linear regression model, here's how you can
apply a support vector machine. And you see from
doing a bunch of models like that, that the actual code
to apply the model if you're using a library like Scikit-
learn, is not that extensive.
Brian Dowe: And so I think getting to the point where you can at
least start applying models and start working with
them, you can do that very quickly. I think once you
enter that phase, transitioning from that into applying
these models in the real world can be a bit steeper of a
learning curve because of the data pre-processing
component. I think in a lot of courses that you'll see or
just like data re-arranged in a fairly neat format,
relative to how you would actually find it in the real
world. And I would even go so far as to categorize it as
a separate skill from modeling that you need to bring
together with modeling in order to apply data science.
Brian Dowe: And you could spend a ton of time sort of as like I've
been trying to do more recently with Kaggle data sets.
Just looking at a random data set that you find and
thinking, okay, how can I arrange this and get it ready
to be fed into a model just where I have all the features
that I need, they are scaled or converted appropriately
into what they need to be like for categorical data like
one hot encoding it to make sure it can be fed into a
model. You can spend a lot of time in just diving into
the data pre-processing aspect. And I think there can
be a learning curve there, but I think if you just push
yourself to start tackling these problems and just dive
in, just look at a random data set and just start
playing around with it, like sort of look up what you
need to as you go for like how you manipulate a
dataset with a given tool like Pandas for example.
Brian Dowe: And if you do that and just practice with different
datasets in different contexts, you can gain a lot of
skills fairly quickly to at least be able to not be
paralyzed by the idea of seeing this data set that you
feed it into a model right away and you get a bunch of
errors back. That can be very discouraging to someone
who is just starting out and doesn't really know what
to do when previously they've just ran a command that
worked perfectly because they had a dataset that was
pre prepared.
Brian Dowe: So the kind of synthesize a little bit, I think that it can
be easier to start learning about how the models work
and apply them on clean datasets and a little bit
harder to get to the point where you can really use
these effectively in practice. But it's still very much
attainable, it's very much feasible if you just push
yourself, apply yourself and dive in, start tackling
problems, even though it can be a bit nerve wracking
at first, it's still a really great learning opportunity,
especially when you're struggling with all these
challenges and pushing yourself to figure it out. I
think that's when some of the deepest learning
happens.
Kirill Eremenko: Totally agree. Through those challenges. And what
would you say is ... like you mentioned some of the
difficult things, especially about data preparation.
Totally agree on that. What are some of the things that
actually keep you going? Have you noticed or naturally
your successful implementation of the Apriori and how
it was used in the business, that was something really
cool that definitely gave you a massive push. Was
there anything else along your journey that you
noticed that once you get to these milestones, that
gives you the inspiration to keep going because that
could be helpful for some or all listeners who are
considering taking this approach?
Brian Dowe: Yeah. So yeah, getting the Apriori live was definitely a
big one to actually see that it was possible to take
something through to production. I think for me I sort
of felt frustrated by the fact that I knew how to apply a
lot of models without diving into the deeper
mathematics. And I really wanted to understand, okay,
like what is gradient descent actually mean? Like I've
seen it covered, I understand the general idea of what
it's doing, but how does it actually work?
Brian Dowe: And something that I did actually very recently is went
through and actually by hand wrote out all of the
calculus to derive gradient descent for a linear
regression problem. And then I did the same thing to
derive the back-propagation algorithm for neural
networks. And I think when I did both of those things
and realized the connection between the two and sort
of saw that the back-propagation algorithm is an
application of gradient descent on neural nets in the
same way that gradient descent works on a linear
regression or at least with the same conceptual
grounding, that was really like a big breakthrough and
understanding for me where I actually ... it's like, what
does it actually mean to train a model? It's like, okay,
you feed it data and it learns how to predict based on
that data, but like what does that learning actually
mean?
Brian Dowe: And then getting on top of gradient descent, I think at
the core of how many algorithms learn, and for me,
that was huge and just ... I think so many things sort
of clicked and came together and made more sense
after I did that. I would definitely recommend that,
maybe not like a first starter activity to do, but after
you've like played around with some models, have a
general idea of like what a model is trying to
accomplish and what training is trying to do, diving in
and trying to understand what's going on under the
hood can make it a lot ... I guess booster competence
when looking at ... it can be sort of disorienting to look
at one line of code and know that that line of code is
the entire training for a machine learning algorithm. It
can be hard to look at that and understand what's
really going on if you like haven't like for example,
done some of these derivations or deep dives into the
underlying mechanics of how everything works.
Brian Dowe: And so, I would definitely recommend doing that at
least like once or a couple times to help yourself
understand those things and it gives you a lot more
understanding when applying the models in practice,
even if you end up using the single line of code
training, it's like you'll understand in a more deep way
what's going on there.
Kirill Eremenko: Gotcha. Very, very interesting advice. Thank you for
sharing that. And yeah, interesting. I think everybody
has ways to get inspiration for pushing themselves
forward and for learning more. That's definitely a valid
suggestion for our listeners as well. One more thing I
wanted to ask you is what kind of tools would you
recommend starting with for somebody who's just
getting into this space or considering getting into the
space?
Brian Dowe: For me, a Python was definitely my language of choice.
I know that like R is also very, very big in the space.
But for me, I had already done some work in Python so
I was a little bit familiar with it. And I do think it's um,
it's a good place to start. Scikit-learn, which I've
mentioned a couple of times is, I think, a really good
library to start with as well because there are a lot of
different machine learning models that it accounts for.
So it can be really easy to swap out different models,
like try different things and sort of see what predicts
and better.
Brian Dowe: Pandas is really important for Python, for working with
your dataset. And I think a lot of times I've seen it
used in tutorials, it's usually just use for like
importing the dataset, but there is a lot more that you
can do with it. Like you can do join sort of in a similar
way that you would do in Sequel. I think Sequel is also
really important to know on a practical level, that's
something that I'm working to build my skills in
because a lot of time your data can be stored in
different databases, maybe something in a sequel
database, maybe something in a no sequel database
like Mongo db, and a knowing how to run queries in
any given situation where your data might be.
Brian Dowe: So you can pull data from different sources and sort of
arrange it in a way that's comprehensible is definitely
valuable. And I think a Sequel is definitely something
that you would see a lot in practice and Mongo or like
no SQL databases are picking up a lot of steam as
well. So understanding basic queries to pull your data.
Once you get your data into, like for example, my case
like a Python script, Pandas can be really useful to
manipulate the data in any way that you couldn't or
didn't with the previous methods before it gets into
your script.
Brian Dowe: Scikit-learn great for modeling, NumPy is sort of like
just a basic use case in Python that ends up working
really well with a lot of the other tools that I've
mentioned. So that is very valuable. If you are looking
to get beyond or get into the spectrum of neural nets,
[Charisse 00:59:28] is really useful. That's something
that I've done a decent amount of work with in my
spare time. And you can create like a neural net in
just a few lines of code and swap out different
techniques or different like last functions or activation
functions to change the structure that you're working
with really easily. So, it's a great tool for people who
are starting out with neural networks to sort of try out
different things and see how different structures of
your network can achieve different results. So yeah,
those are probably some of the tools that I've worked
with the most while trying to teach myself, and I'm still
very much a novice with a lot of these things. But it's
been a very useful way for me to get started and dive
in.
Kirill Eremenko: Gotcha. Well, thanks so much. That's some great
advice and great list of tools. And Brian, we're, uh,
we've gone to the end of the podcast in terms of time.
Thank you so much for coming on the show and
sharing all these insights. I think it was really cool, we
discussed your journey through a development or
software development to machine learning, the
different stages in building, deploying and maintaining
a model and what we just talked about getting into the
space of data science, how easy it is or what
challenges people face. Before we wrap up, I did want
to ask you, what is the best way for our listeners to get
in touch with you to follow your career on and maybe
learn more things from you?
Brian Dowe: The best way to get in touch with me would be a
Linkedin, uh, just a linkedin.com/in/Brian-Dowe.
Yeah, that's the best way to get in touch with me. If
you message me on there, I see and respond to all my
messages on Linkedin.
Kirill Eremenko: Awesome. All right, cool. Well that's where you guys
can find them, Brian. And before we finish up today,
one last question, what is your favorite book that you
can recommend to our listeners to empower their
careers?
Brian Dowe: One thing I've already talked about a lot is Machine
learning A to Z and I just want to ... Well, first off,
thank you so much Kirill and also to Hadelin for
putting that out there. It was the course that got me
started on this journey and I can't thank you enough
for that. But specifically why I'd recommend this, it
gives you a lot of tools to work from. You learn about
not only how to code a lot of algorithms, but in Kirill's
intuition tutorials, he talks a lot about good use cases
for the algorithm, he gives an overview of how the
algorithm works and what it's doing under the hood.
Brian Dowe: And in my case, being able to identify the Apriori
algorithm as the best tool to solve the problem that I
was working on. It was directly connected to Kirill's
explanation on what that algorithm is used for. And so
I think even if you are just starting out, what this gives
you is it gives you a roadmap of different tools that you
can use, different algorithms that you can work with
and the use cases that they apply to. So it empowers
you to be able to at least look at a problem and think,
oh, I can think of some algorithms that would fit this.
Even if it's as simple as, oh, this is a regression
problem versus a classification problem. It's like, okay,
well now I've narrowed down the list of things I need to
look into that could be a potential solution. And I was
very pleased with how quickly I was able to do that
after taking that course. And so, especially if you're
just starting out in the field, I would highly
recommend it.
Kirill Eremenko: Thank you so much. That's so nice to hear. And you
actually gave me an idea because like usually for this
question, I expect a book and you recommended a
course and I actually thought of we should release this
course as a book. Like it should be a book machine
learning A to Z. I think that would be pretty cool for
people to get their hands on as well, like a supplement
thing. Thanks for the idea anyway, I'll talk to Hadelin
about it.
Brian Dowe: Sounds good.
Kirill Eremenko: Yeah. Well, thanks so much for coming on the show.
It's been a huge pleasure and I'm sure a lot of listeners
got so much value out of it. Once again, thank you so
much.
Brian Dowe: Yeah. thank you so much for having me.
Kirill Eremenko: So there you have it, ladies and gentlemen, I hope you
enjoyed this podcast. Quite a lot of different topics that
we discussed and very inspiring career path. I'm sure
you'll agree that Brian is creating for himself how he's
leveraging the skill he has to be better at data science,
how he's leveraging data science to help of the work
that he's doing, how he's bringing data science and
machine learning into the company where he's
working and helping them create better products and
services or service their customer base even better and
derive value out of their data even more efficiently.
Kirill Eremenko: My personal favorite part of this podcast was the
discussion we had about modeling, the development
deployment and maintenance parts or stages of any
models lifecycle. And personally, I really enjoyed that
whole brainstorming part we had about maintenance
and how you need to think about maintenance and
how you can actually go about maintaining a model. I
think we came up with some interesting ideas. And
just in general, the whole process, you can see the
whole of like discussing things with your colleagues
and peers can be very helpful in terms of sharing ideas
with others but also coming up with ideas in the
process. And I think that's what we saw in this
podcast.
Kirill Eremenko: On that note, you can find all of the links to this
podcast at www.superdatascience.com/215, and that's
where you can also find the URL for Brian's Linkedin
and connect with him there and make sure to follow
his career and see what he gets up to and maybe share
a message if you have any questions about modeling, if
you have any questions about integrating data science
and machine learning into your careers in a similar
way that he did.
Kirill Eremenko: And there we go. Hope you enjoyed this podcast. If you
did, make sure to leave us a review on iTunes. That
would be very helpful for us to spread the word and
get more people involved so that they know that there
are valuable insights that they can pick up from this
podcast and from our guests. Apart from that, thank
you so much for being here today and sharing this
hour with us. I can't wait to see you back here next
time. And until then, happy analyzing.