Kirill: This is episode number 47 with Deep Learning Expert
Hadelin de Ponteves.
(background music plays)
Welcome to the SuperDataScience podcast. My name is Kirill
Eremenko, data science coach and lifestyle entrepreneur.
And each week we bring you inspiring people and ideas to
help you build your successful career in data science.
Thanks for being here today and now let’s make the complex
simple.
(background music plays)
Welcome everybody back to the SuperDataScience podcast.
Super excited to have you on board, and today by popular
demand, we have a returning guest, Hadelin de Ponteves.
Hadelin was on the show for the first time about 6 months
ago, that was episode 2, and that was when we just released
the Machine Learning A-Z course. And now he's back, and
this time we've just released the Deep Learning A-Z course.
And so what is deep learning all about? Well deep learning is
an advanced branch of machine learning where we use
algorithms called neural networks to mimic the human
brain in order to be able to solve very complex problems.
And our goal with Hadelin in creating Deep Learning A-Z
was to create a course that has a robust structure and really
covers topics in a simple manner that is accessible to
anybody. So you don't have to be an expert in mathematics
or in programming or in anything else for that matter. You
just need a basic background in high school maths to
understand this course and to follow along.
And that's exactly what we created. It's released now and it
was actually also featured on Kickstarter, where it had
immense support from backers and we're very excited about
that. We're very excited to bring this course to the world and
today we're going to run through all the six different models
that we discuss in the course and give you a quick
breakdown.
And finally, for those of you who don't know Hadelin, I
wanted to mention that Hadelin has experience in deep
learning from Canal+, which is a competitor of Netflix, and
as well as that, Google. So this is definitely a person who
knows his way around deep learning and we're going to
learn quite a lot in today's podcast. So I'm very excited for
you to hear this episode and get your glimpse into the world
of deep learning and the different models that exist there.
And without further ado, I bring to you Hadelin de Ponteves.
(background music plays)
Welcome everybody and welcome, welcome, welcome Hadelin
back to the show. How are you going, my friend?
Hadelin: I'm doing very well, thank you. I'm very happy to be back.
Kirill: Awesome. Well you definitely should be because 12,000
views as of today, your podcast, your first episode has
12,000 views. How do you feel about that?
Hadelin: Wow, that's amazing! I didn't expect that many views, so I
am very happy about that and I hope this has helped some
people and that people could be inspired from it.
Kirill: Yeah, I definitely hope so too. And I'm actually sure of it, and
it's raised some very interesting debates, especially the
podcast was about machine learning. But what raised
debates was your health, since you mentioned you were
sleeping 3 hours a day for the past 3 years.
Hadelin: I know right, some people were concerned! I got messaged on
LinkedIn to tell me that I should be careful! Thank you very
much for all of them.
Kirill: Yeah, that was fantastic. Everybody worried about Hadelin
out there, he's still alive. He's still fine, and as always, very
energetic. So, mate, what have you been up to? Or what
have we been up to over the past couple of months? What
projects have we been working on?
Hadelin: Well we made the deep learning course! The big brother of
the machine learning course. So yeah, this one is I think
more powerful because we dive deep into some more
advanced techniques and we code some more advanced
stuff, like we use classes and objects to implement some
deep learning moles from scratch. So I think it is quite new
and people will still improve a lot their skills even after doing
the machine learning course. So I think it's very
complementary and it's basically taking things at the next
level.
Kirill: Exactly. Exactly. And also I wanted to point out here that
our goal was to create the most disruptive course on deep
learning and really collate a lot of information from pretty
much everywhere and put in our knowledge and experience
into this course and our view on things and put together,
most importantly, not just one, not just two, not just three,
but six different models on deep learning. And I think we
finally can say that we have completed this course and it's
there and people are going through it and we're very, very
excited about it.
Hadelin: Yes, and actually I finished implementing the last, the very
last model of this course yesterday. It was the Boltzmann
Machine, which took 14 tutorials. And I was really happy to
finish it because it was quite a challenge. Because this is
one of the most advanced models in this course because it's
a probabilistic graphical model so it handles a lot of
probabilities and we had to dive into the MCMC, Markov
Chain Monte Carlo techniques, with Gibbs sampling, with
the random walks, so a lot of very cool mathematical
concepts, but we did it yesterday and I was very happy to
finish on that note.
Kirill: Fantastic, and congrats on that. It was definitely a big
course to tackle. And for everybody listening today, we're
going to talk about deep learning. This podcast is dedicated
to deep learning and summarising and running through all
of the things that we have covered in the course. And just to
give you, even if you're not taking the course, just to give
you a feel and taste for what deep learning is about, what
type of models exist out there, what type of approaches,
techniques, what are the use cases and applications. And so
we'll be running through six different models of deep
learning, giving you our comments on that and this is going
to be quite an exciting podcast. Really looking forward to
this session. How about you, mate? You excited about this?
Hadelin: Yeah, very excited.
Kirill: Alright. Ok. So let's kick things off with the very first, very
basic model, the artificial neural networks. It goes into the
foundation of all of the deep learning concepts and
principles and outlines everything there. And probably we'll
start off by saying that in artificial neural networks -- oh, by
the way, if you're listening to this podcast and you haven't
taken the course yet, then probably you should know that in
this course, just like in the machine learning course, I was
doing the intuition tutorials and Hadelin was doing the
practical application tutorials. So you might hear us
comment on the deep learning methodologies from both
sides exactly in that manner.
And in terms of artificial neural networks, we're trying to
model the human brain. So we're creating this structure
which is full of neurons which are interconnected. And the
fascinating thing, just by creating this course, I personally
learned a lot. Especially about the field of neuroscience. I
found out that in the brain we have 100 billion different
neurons and each one of them is connected to at least as
many as a thousand other neighbours. So I just wanted to
get your comment on that, Hadelin. What are modern deep
learning neural networks like? Are they anywhere close to
that size?
Hadelin: Well, at least that’s our goal. We are trying to build some
models that mimic the processes in the human brain. But,
of course, we’re not there today. As you said, there are
billions of neurons in the human brain and there are lots
and lots of connections between the neurons. So far, when
we talk about artificial deep learning, the artificial neural
networks we make when we solve some problems, they
contain several, maybe dozens of layers at most. So we’re
very far from what’s happening in the brain, but we’re trying
to mimic what’s happening in the brain, we’re trying to
reproduce the structure and the connections by adding
some mathematics into it, you know, to make some relevant
models that can solve complex problems with lots of non-
linear relationships. Only models that are close to how the
brain works can solve that kind of problems.
Kirill: Okay. And why is that? That’s actually a good segue into
what’s the whole point of deep learning? Why can’t we just
stick to machine learning and solve all of our problems with
machine learning?
Hadelin: Because basically there are limitations in machine learning.
In machine learning we can solve a lot of problems, but
when the problems become very complex because, you
know, problems are defined by their relationships, whether
they are linear, non-linear, how complex are the non-linear
relationships, and when we are reaching some high level of
complexities, because with deep learning we can extend the
complexity by adding some layers and some neurons, we can
basically solve more and more complex problems thanks to
the fact that we can add these layers and these neurons.
With machine learning, while you have some fixed models,
you cannot really extend the models of machine learning,
while you could for XGBoost, for example. XGBoost is
actually another great model that is used to solve very
complex problems. That is because you can add some trees
in XGBoost because XGBoost is like a Random Forest but in
a much more advanced construction. So that’s the thing –
you can add some level of complexities by adding some trees
or layers in the models and that’s how deep learning can
solve complex problems as opposed to machine learning.
Kirill: Okay, gotcha. So, basically, at some point you reach a level
of complexity. For example, recognizing objects in an image.
That’s pretty complex, right? And not just recognizing
objects from an algorithm which tells you “Look for this type
of pixel,” but automatically learning how to recognize objects
from many, many thousands of images. That’s a complex
problem.
Hadelin: That’s right. That’s a very complex problem. Not only do you
have to understand some patterns in thousands and
thousands of images, but also you have to understand some
patterns in the pixels. And there are thousands and
thousands of pixels so that makes the problem really, really
complex and, of course, this is not solvable with machine
learning, classic machine learning.
Kirill: Okay, gotcha. So, going back to ANNs, we introduced a
couple of concepts there, and I know it’s going to be
extremely—like, the course is how long, 20 hours or
something so far?
Hadelin: Yeah, little more than 20 hours. We’re going to add some
more tutorials.
Kirill: Yeah, it’s going to be extremely hard to convey some of these
topics in the podcast. But just quickly, we introduced a few
concepts and probably the key one is the activation function.
What happens in the activation function? Can you tell us a
bit about that?
Hadelin: Okay. First of all, there are different kinds of activation
functions. We have the rectifier activation function that is
used to break the linearity because that’s the whole point of
it. We are trying to solve non-linear problems. In order to be
able to solve these non-linear problems, we have to break
the linearity between one hidden layer to another hidden
layer. That’s what the rectifier function is for and actually
you explain it very well in one of your intuition lectures.
And then we have the sigmoid activation function, which is
another kind of activation function, that is used to output
the predictions in terms of probabilities. Instead of returning
an exact outcome, for example a binary outcome 0 or 1, you
will use the activation function to model the probability of an
outcome. So, for example, that will return 0.8, meaning that
the outcome will have 80% chance of being the right
prediction.
And then you have some other activation functions, but
basically the idea of an activation function is to activate the
neuron. That’s used to activate certain neurons in the
neural networks with certain weight, and the higher is the
weight, the more relevant will be the neuron.
Kirill: Yeah, gotcha. That is very similar to what’s going on in the
human brain, in our brains. As we’re just talking now,
what’s going on is neurons are firing up, they’re sending an
electrical impulse to the following neuron, then that neuron
is getting electrical impulses from many different neurons
around it, so up to a thousand different neurons. It’s
combining all of that, it’s making a decision whether it needs
to fire up or not, and then it’s passing on (or not) an
electrical signal of varying intensity to the next neuron and
so on.
And when you combine all of that, you have this whole huge
army of neurons sending around electrical signals and that
is what thought is, that is what all of these ideas that we
have, all of these concepts and our interaction with the
world, all of our senses, they all translate into that. It’s
fascinating when you think about it. All the thoughts that
we have are basically just electrical signals running around
in our heads.
Hadelin: That’s right, yes. That’s fascinating. And it’s fascinating that
we’re managing to reproduce that with the models that we
make ourselves.
Kirill: Yeah, totally. That’s very important, that we mention that.
So we have neurons in the artificial neural networks and we
have activation functions which kind of connect the neurons
and facilitate the interaction between the neurons. So in the
artificial neural network, we have three types of layers. Tell
us about that. What layers do we have in the ANN?
Hadelin: Okay. Of course, let’s start with the input layer. The input
layer is the layer that receives the observations. For
example, let’s say we’re trying to predict if some customers
are going to leave or stay in a company or leave or stay in a
bank. Well, the input layer would get the information of the
customers that will go through the network to be able to
then make some predictions. That’s the input layer. It just
receives the observations.
And then we have the hidden layers, and that’s where
everything happens. That’s where the learning happens.
That’s where the model is trying to learn how to make some
correlations between the information of the input layer and
the output, which is the final prediction. And this final
prediction is going to be compared to the real outcome and
that’s how the model is going to learn, because it’s going to
compare the prediction to the real outcome and, according
to the mistake it might make, it will correct what happened
before. That’s backpropagation. And backpropagation will
then correct the weight so that it can learn some better
correlations next time. So after the hidden layers we the
have output layer that get the final prediction.
Kirill: And the more hidden layers we add, the more complex
becomes our neural network. It’s harder to train, but it
might be able to solve more complex problems. Is that right?
Hadelin: Yes, that’s correct.
Kirill: Okay. And you mentioned a couple of interesting terms.
First of all, if I’m not familiar with neural networks I might
ask the question, “What is the purpose of building a neural
network if on the output you’re comparing the results to the
real outcome so that means you already have the real
outcome? What’s the whole point of modelling an outcome?”
Can you comment on that?
Hadelin: Sure. That’s because for any machine learning model, or any
deep learning model, there is a training phase and a testing
phase. In order for the model to learn something, it needs
the real outcomes to learn the correlations. Because if it
didn’t have the real outcomes, it couldn’t learn anything. It’s
like when a student is practicing for an exam and he’s
learning a lesson or a topic, it needs the real outcome when
it’s training so that it can evaluate how he understood the
course.
Well, that’s the same for a deep learning model. It needs a
training phase with real outcomes so that it can make some
predictions itself, but then it has to compare these
predictions to the real outcomes so that it can correct itself.
And then we have the test set. And on the test set we have
some totally new observations for which we don’t have the
real outcomes and this is what pure predictions are about.
This is really pure predictions where I don’t have anything to
compare them.
Kirill: Okay, gotcha. And in this case we actually compare to the
real outcomes. So some models train with the real outcomes
and some models still train, but without the real outcomes.
The two different ones are called supervised models and
unsupervised models. So artificial neural networks are a
type of supervised model and in total we looked at three
different supervised models and we looked at three different
unsupervised models.
And probably the last comment is on backpropagation.
You’ll be hearing the word ‘backropagation’ quite a lot. It is
relevant to supervised deep learning models and basically in
summary it’s exactly what Hadelin described. You compare
it to the real output, you find the error and then you
backpropagate – hence the name backpropagation – you
backpropagate that error through the network to update the
network in very simplistic terms, and that’s how the models
train. So that was artificial neural networks. Let’s move on
to number two: convolutional neural networks. CNNs – what
are they used for?
Hadelin: CNNs are mainly used for image detection, but they have
various other applications like text recognition. How does
image detection work? Well, the CNNs are going to try to
learn something like some patterns in the pixels and that’s
how they will detect some specific features and images to be
able in the end to recognize what is in the image. For
example, in our course we’re training an algorithm, a CNN
actually, that is learning to predict whether there is a cat or
a dog in images. To do this, it will try to understand some
patterns and the pixels of the images to detect some specific
features of dog and the specific features of cat. And that’s
how in the end it can manage to predict if there is a cat or a
dog in the image.
Kirill: Yeah. And the most fascinating thing for me is that it can
learn that – and we got a pretty good accuracy rate in the
course – it can learn whether it’s a cat or a dog just by
looking at lots and lots of images. How many images did we
have in the course?
Hadelin: 10,000. We’re training the CNN on 8,000 images and we’re
testing it on 2,000 images. And indeed, on the 2,000 images
we get a pretty good accuracy that is pretty good at correct
predictions.
Kirill: About what? What is it like?
Hadelin: I think we reached 82%.
Kirill: 82% accuracy?
Hadelin: Yes.
Kirill: Which is great. So, it means that CNN had a look at 8,000
images of dogs and cats which are labelled so it knows that
this folder has cats and this folder has dogs. It just looked
through them. Without anything else, no tricks or hacks, we
specified nothing. We just said, “Look at these images. These
are cats, these are dogs, and now decide for yourself what is
important for you in an image, what features are you going
to be looking for when you’re looking at new images to
decide whether it’s a dog or a cat.” And then after all the
training we gave it 2,000 images of dogs and cats and out of
those 2,000 it got 85% correct. So it identified 85% of the
images correctly, that these are dogs/cats. Without us
having to tell it how to do it, it learned the thing itself.
Hadelin: Yes, that’s right. And these are our new observations, new
images. And besides, this is without any parameter tuning,
because in the course we insist on parameter tuning and
that’s one of the homeworks we gave to the students. We let
the students work on the model to improve them, tune them.
Actually, one of the challenges is to get 90% accuracy, which
I know is possible so that’s why I gave the challenge. The
students who manage to reach that 90% accuracy, they will
have the gold medal.
Kirill: (Laughs) Has anybody gotten 90% accuracy yet?
Hadelin: Somebody got the gold medal on the other challenge, which
is for ANN. Actually, that was today, so I congratulated that
student. He got 87% accuracy, which is very good on the
other problem.
Kirill: I think I saw that message. His accuracy was even better
than in the training set. It was higher on the test set than on
the training set.
Hadelin: Absolutely. That’s the one, yes.
Kirill: Okay, gotcha. All right, very cool. So, that was CNNs – a very
brief intro into convolutional networks. And it’s very
important to understand the concept behind CNN because
that is not in its raw form what goes into self-driving cars,
but that is the direction in which things like self-driving cars
are aiming. They need to recognize pedestrians on the
streets, they need to recognize stop signs, they need to
recognize the colour of the traffic light to work completely
autonomously. That is your first step in that direction.
Would you say that is a fair summary?
Hadelin: Yeah, that’s absolutely a fair summary. Of course, CNNs are
used in self-driving cars to detect objects on the street,
which is absolutely compulsory.
Kirill: Yeah. And there was actually a challenge to see how well
computers can do to recognize different types of road signs
and right now they’re already doing better than humans. It’s
really interesting. All right, moving on to the next one:
recurrent neural networks.
Hadelin: Oh, that was quite a challenge.
Kirill: (Laughs) Yeah. For us, we spent so much time on recurrent
neural networks simply because of the challenge that we set
ourselves. We wanted to predict stock prices, but in the end
we found out that it’s too chaotic to predict with recurrent
neural networks to the extent that we attempted that
challenge. So probably if you spend more time you might be
able to find ways to do it better, but nevertheless a very
interesting type of neural network. This is a neural network
that has short-term memory. All neural networks have long-
term memory and that’s when you train them up, they
remember the structure and the configuration of the neural
network, will remember the training and then it will apply
that knowledge in the testing and hence it can recognize
dogs or cats or solve complex problems. But recurrent
neural networks actually have short-term memory, so if it’s
going through a dataset in row 51, it will remember what the
outcomes were for row 50 or row 49 or row 48 and so on.
What kind of applications does that open up for the world of
deep learning?
Hadelin: Well, there are many applications. For example, there are
applications for natural language processing. You can use
recurrent neural networks for natural language processing if
you want to predict what’s going to be the next step in a
sequence of text. So, for example, you can predict what’s
going to be the last word in a sentence and more, so you can
increase the complexity of the problem, and it can also be
used for video classification. For example, if you want to
predict what’s going to happen next in a video, then in that
case you will need to combine convolutional neural networks
to RNN because RNN are basically used to predict a time
series, like what’s going to be the next step in a series of
events. So that can be used for text, as I said, or for videos
as well.
Kirill: Okay, gotcha. So, RNNs are very often used in combination
with other algorithms like CNNs. And you can train up an
RNN to basically just go through like a huge amount of text
and learn from it how sentences are structured, how words
follow each other. That way you train up a deep learning
model to create sentences, in essence.
In the course we actually mention a small video, a 9-minute
film, called “Sunspring” which was entirely written by an
RNN, by specifically an LSTM, so long short-term memory
type of RNN which was trained up on thousands of sci-fi
films and then it wrote its own sci-fi film and people actually
acted it out, professional actors acted it out, and it
participated in the – I think it was the Sci-Fi London Film
Festival and it was rated – I think it was in the top 10. So
that’s pretty exciting. That’s where the world is going.
Hadelin: Yes. That’s a pretty exciting application.
Kirill: Yeah. And you can think of lots of other applications.
Basically, RNNs are there to facilitate the short-term
memory which we humans have. It’s so powerful. We don’t
just have long-term memory. If you just had long-term
memory then you wouldn’t remember the start of this
podcast and would be very sad. You’d be sitting there
thinking “What are we talking about? What is this whole
topic right now?” Short-term memory is a very important
tool that we have as humans and therefore, why would we
deprive deep learning models of that concept? And that’s
why RNNs exist. And probably the last thing on RNNs I
wanted to mention here is LSTMs. Can you tell us a bit
about LSTMs? This is actually a huge breakthrough for the
world of RNNs. What is it all about?
Hadelin: Yes, this is actually the disruption in RNNs because, as you
said, the classic RNNs have short memory, but the LSTM is
the first RNN to be able to learn long-term relationships.
Basically, it’s the first RNN to have long memory. That’s why
it’s called LSTM – long short-term memory. Basically, that’s
the most powerful RNN model and that’s the one we
implement in our course.
Kirill: Yeah. I think it was created in Germany in the 90s.
Hadelin: Yes, 1997 or something.
Kirill: Yeah, it’s pretty cool. And actually it’s very interesting,
throughout the course we mentioned the creators of these
models and people who came out with them. We’ve got
Geoffrey Hinton, we’ve got Yann LeCun, Yoshua Bengio. All
of these scientists, I think they’re all from Canada, if I’m not
mistaken.
Hadelin: Yann LeCun is French.
Kirill: Yann LeCun is French? Okay. That’s right. But now they live
in Canada/America and so they have their own little circle
which I think Yann LeCun calls the conspiracy of deep
learning. It’s very interesting. And then all of a sudden you
have somebody from Germany creating the LSTM. When I
found out about them, where the LSTMs came from, it was a
pleasant surprise that the whole world is actually
contributing to this movement. It’s great.
Okay, next we are moving into the world of unsupervised
deep learning. First of all, can you tell us a bit about
unsupervised? What does that mean, ‘unsupervised’?
Hadelin: Well, unsupervised means that basically you don’t have an
outcome to compare your prediction to, so basically what
you have to do is identify yourself some structure in the data
that will become your future predictions. So, basically, when
we’re starting with unsupervised learning, we don’t have a
dependent variable. We don’t have something that we want
to explain. But by identifying some segments or some
clusters or some structures in the data, we will eventually
end up with such a dependent variable. And actually, in the
course we highlight this transition from unsupervised to
supervised because we once we complete our unsupervised
deep learning model, we end up with a dependent variable
that can lead our model to become a supervised deep
learning model.
Kirill: Yeah, that’s definitely a very powerful thing. Unsupervised
models, I think they’re more complex generally because each
one of the models that we discuss has its own very unique
approach to learning, kind of a trick or a hack or a whole
new concept that it introduces in order to bypass the fact
that it doesn’t have this output to compare to.
We start off by talking about SOMs – self-organizing maps.
These neural networks were first created in I think 1982 by
Teuvo Kohonen, a Finnish professor, and they are very
interesting. By far, they are the simplest out of the whole
course just because they are so elegant and the idea behind
them is so straightforward, even the mathematics. We don’t
talk about mathematics in the course, we don’t go deep into
the mathematics so we don’t get bogged down, but the
mathematics driving the other models in the course, they’re
pretty complex. But for self-organizing maps, they are very
straightforward, they’re very easy to code even from scratch.
We talk about self-organizing maps and we find out how
exactly they work. So what can you tell us about SOMs,
Hadelin?
Hadelin: Okay. First, what is the purpose of SOMs? Well, it’s to detect
some features in a very complex data. There is a high
dimensional dataset again full of non-linear relationships. It
will detect some features which we have absolutely no idea
what they are, but it will detect some features inside this
data. And how will it do that? Well, it will do that by
reducing the dimensionality of the dataset. That’s why at the
beginning we start with a lot of the features that are
columns in the dataset, and a lot of observations, and
eventually we end up having this two-dimensional map. And
on this two-dimensional map we can see some neurons that
we call the ‘winning nodes’. Basically each cluster, or each
winning node, is detecting a certain feature. That’s pretty
powerful because even by having a very complex dataset at
the beginning, we end up with this cool two-dimensional
map that is very visual and that we can use directly to detect
some specific features or see some specific clusters,
segments in the dataset. And actually we implement a self-
organizing map to detect fraud and that’s because fraud is a
specific feature detected by the SOM.
Kirill: Yeah. It’s a very visual type of algorithm.
Hadelin: Yeah, it’s very visual.
Kirill: So, you might have a huge dataset with lots and lots of
columns which there is no way you would be able to
visualize in a concise way, but then the SOM allows you to
reduce that, put it on a map, and then you can see all of the
connections.
In the intuition side of things we walk through an
application of SOMs to the – I think the U.S. Senate or
something, part of the United States government, how they
vote, and you can see how they’re all plotted in a self-
organizing map and then you can see how to read that map
and understand what it’s saying about how they’re voting,
Republicans versus Democrats and so on. In our practical
tutorials, we actually have a very interesting application. Tell
us a bit more. What did you prepare for us in the practical
side of things to kind of showcase how SOMs can be used?
Hadelin: Okay. So, the data sets contain some credit card
applications. Some customers – well some not-yet customers
– some people applying to have an advance credit card in
their bank and basically to apply for this credit card they
need out a paper and provide a lot of information like their
credit score or other type of financial information like their
estimated salary.
And basically at the end of the application there is a yes or
no whether they got approved for the credit card. Of course,
like in any application, there are some people that can
cheat, and the goal is to find the potential cheaters, to detect
the potential frauds in the applications. So we have no idea
how to visualize that at the beginning because we have a lot
of information, all the information that was filled in by the
customers, but the SOM will manage to detect the frauds by
detecting some specific features in the self-organizing map.
I’m not going to say right now what are going to be the
features in the SOM because it’s actually something that the
students have to guess at some point, but these features,
these frauds, are pretty specific, are pretty visual in the self-
organizing map so we can really identify them well.
Kirill: Yeah. I found that that was a very cool application of self-
organizing maps. When you came up with that challenge for
the course I was very excited because I’ve worked with fraud
analytics before, back when I was at Deloitte, but I’ve never
actually seen self-organizing maps applied to solve the
problem and I think that this approach is very interesting
because it does allow you to go through lots and lots of data
without having to supervise the model, without having to
come up with these inputs at the start and therefore it can
really find it. It’s like a robot or a computer looking for
people who have committed fraud. Like, what are a human’s
chances against a machine? Zero, right? Pretty much zero.
Hadelin: Zero. Let’s hope this doesn’t go too far.
Kirill: Yeah. Let’s hope it doesn’t turn into World War III or
something like that.
Hadelin: Exactly.
Kirill: Yeah. And in terms of SOMs, what I found when I was
creating the intuition tutorials, I found that they’re really
very different to all the other five models that we discussed
and I would never have thought of classifying them as a
deep learning model. I always thought SOMs are just a type
of dimensionality reduction model. I think the lack of
backpropagation or the lack of more interconnectivity
between neurons makes me kind of think that in some way
they’re just a bit too simple to be considered as a deep
learning model. What are your thoughts on that? Is it maybe
just because they’re so elegant that they give you this
impression?
Hadelin: Well, first of all, I agree with you. I had sort of the same
feeling when I started studying about SOMs. But with no
hesitation I wanted to include them in the course because
they actually involve neurons. The points in the grids are
actually neurons and neurons are attracting other neurons
around them according to how they are similar to these
neurons. Of course these are very different from the other
neural networks that we implement in the course, but this is
still a neural network in two dimensions having several
neurons. It’s considered as a neural network, and that’s why
it’s considered as deep learning. But you’re definitely right
that this might be the most simple deep learning model in all
these neural networks.
Kirill: Gotcha. But even saying that it’s simple, the applications are
massive. Maybe the simplicity facilitates more applications.
We briefly mentioned a paper in the course about how SOMs
are used to analyse the probability density function of
photometric redshifts, so basically an application in
astronomy. Then we look at an example of how it’s applied to
World Bank data to look at different countries and their
prosperities, or poverty and then plot that on a map – and
plot that on an actual world map. So applications are
immense, limitless for self-organizing maps, and maybe that
has something to do with the simplicity.
Hadelin: That’s right.
Kirill: But moving on, next one is our favourite or probably your
favourite algorithm – the Boltzmann machine. By far, hands
down, the most complex algorithm that exists on this planet.
It was so much fun preparing tutorials, for me anyway. I
know that you did 14 tutorials about Boltzmann machine
and spent like over a week on that just recording.
Hadelin: Yeah, I was very excited recording that.
Kirill: How did that go? Tell us about Boltzmann machines. What
are they all about?
Hadelin: Well, first of all, they are very broad. The most important
thing is that this is a rupture between what we had before
because what we had before were sequential models with a
sequence of layers. We started with the input layer and then
we had some sequence of hidden layers and then eventually
the output layer.
And here, that’s totally different. We now have some neurons
and all the neurons are connected to each other and there’s
no longer input neurons and an output layer. Actually, what
happens is that we have some visible nodes and some
hidden nodes and basically the graph – because that
becomes a computational graph – the graph is updating
itself and the input nodes are getting updated so that in the
end they become the output nodes, but there is no output
layer.
It’s like a graph full of probabilities, because basically
Boltzmann machines are probabilistic graphical models, and
this is a graph that is updating itself and in the end it’s
maximizing what we call the likelihood that allows to make
the nodes all relevant to each other with some relevant
outputs, which in the end are predictions. Because of all
these complexities, because that involves a large number of
computations, we have what we call the ‘restricted
Boltzmann machine’ when we have to filter the connections
between the nodes. And in the restricted Boltzmann
machine, all of the nodes are no longer connected to each
other. Only the visible nodes are connected to the hidden
nodes and vice versa.
Kirill: Yeah. And it’s just a fascinating type of model. I’ll give you
an example of how it’s so different to everything else that
we’ve seen before. This is an example that we use in the
intuition tutorials in the course. Imagine a nuclear power
plant which generates electricity. That in itself is a system, is
a huge system which has lots of parameters. You have the
speed of a wind turbine, you have the temperature inside the
main core of the power plant, you have the pressure in
certain water pumps. You have lots of parameters that
govern how this facility is functioning. But there’s also lots
of parameters that are out of your control, parameters that
you can’t measure, which might be, for instance, the
moisture of the soil in a certain location. There’s so many
things, so many moving parts in this whole system, you
can’t measure them all at once.
What a Boltzmann machine does, and that’s why it’s a
probabilistic model, it generates all of these different states
of this nuclear power plant just randomly and then based on
your inputs, you’re able to tweak the Boltzmann machine for
it to be a better representation of your specific nuclear power
plant. Not just any nuclear power plant in the world,
possible or impossible, you kind of restrict it — and this is
not in any way connected with the term ‘restricted
Boltzmann machine’, those are different. Anyway, you
restrict this whole Boltzmann machine to being a
representation of your nuclear power plant and that allows
you to model very interesting things. And why is a nuclear
power plant a good example? Because you cannot model all
of the scenarios in a nuclear power plant through supervised
deep learning because in supervised deep learning you need
a training set. And you just don’t have, and it’s impossible to
have, lots of training data on nuclear power plant
meltdowns.
So if you want to model all the possible situations in which a
nuclear power plant explodes or disasters happen on it, in
order to be able to prevent them, you cannot do that through
supervised learning just because you don’t have the training
data. And that is where unsupervised models, for example,
Boltzmann machines, come in and they can really help out
with this situation because they are generating these
scenarios on their own at random and that allows you to go
venture into the scenarios that haven’t even happened in
real life.
Hadelin: Right. And besides, by giving this example you explain the
Boltzmann machines from the energy-based point of view
because actually a Boltzmann machine can be seen on two
different points of view. The first one is an energy-based
model, so exactly as you just explained, and the second
point of view is that it’s also a probabilistic graphical model,
and that’s what we focus more during the practical
applications. It’s good that students get to see both points of
view.
Kirill: And probably if you want to challenge yourself to do
something extremely interesting and complex at the same
time, in terms of just grasping and getting your head around
it, Boltzmann machines are the way to go. There are a
number of things that you’ll probably find challenging to get
your head around in the space of deep learning that are very
significantly reduced because if you understand Boltzmann
machines, then anything else is going to be a piece of cake.
That’s a very challenging topic but it’s also worth attempting
that challenge.
All right, moving on to final model in the course: the
autoencoders. Tell us a bit more about autoencoders. Where
does the name come from?
Hadelin: Well, autoencoders are my personal favourite because —
well, I don’t know if they’re my personal favourite because I
really like Boltzmann machines too, but I like autoencoders
because basically they’re quite simple, especially after
studying Boltzmann machines, and at the same time they’re
capable of solving extremely complex problems and that’s
because they’re stacked autoencoders.
Basically, we implemented a recommender system with
Boltzmann machines that can predict binary ratings and
with the autoencoders we would take it to the next level by
predicting some ratings that are from 1 to 5, which is a more
complex problem. And yet, the autoencoders is a more
simple problem because basically the simple autoencoder is
composed of three layers: the input layer that gets the
neuron, the observations, the hidden layer that is a layer
with a small number of nodes compared to the input layer,
and we have the output layer. By putting the observations
into the [indecipherable 47:39] we’re trying to reconstruct
the input observations in the output layer.
That’s how it works and that’s why it’s called autoencoders,
because basically what happens is a two-step process. The
first step is the encoding step, when we try to encode the
observations into this hidden layer composed of a fewer
number of neurons, and then there is a decoding step, when
we decode the hidden layer to reconstruct the hidden layer
into the input layer. So, we’re trying to replicate the input
layer by decoding it.
Kirill: That’s a great summary. That’s what I meant when I said
that all of these 3 unsupervised deep learning models, they
have their own ways to get around the fact that they don’t
have the data that they need to look at the real outcome.
And the way that autoencoders get around that is they make
the input be the outcome that they’re aiming toward. In a
way, they’re not purely unsupervised, sometimes they’re
called ‘self-supervised deep learning model’ because they are
in essence supervising themselves through the inputs that
they’re aiming to recreate as outputs. It’s their way of
cheating the system a little bit, but nevertheless they are
extremely powerful. And I remember during the course when
we were creating it – or just before we created it – you were
super excited that there was some breakthrough in stacked
autoencoders and you were like, “We have to include this in
the course. We definitely have to. This is all so brand new.”
Hadelin: Yes. And we actually implemented stacked autoencoders
because basically stacked autoencoders is what I just
explained but with several hidden layers. So that means that
there are several encodings and several decodings and that’s
exactly what we implement. I think we have two or three
hidden layers, but then the challenge, of course, is to change
the architecture of the model and that’s very fun. Students
will learn how to change the architecture of the model to add
more layers and to add more nodes to tune the number of
nodes and other parameters. So, they will be some sort of
artists trying to create some other structures of stacked
autoencoders. And that’s a pretty good challenge.
Kirill: That really ties in with what you said on the first podcast,
which was six months ago — can you believe it has been six
months since then?
Hadelin: Time has flown.
Kirill: Yeah. And I think what you mentioned there was the artist
versus engineer. That’s two categories of data scientists.
You’re going to have the artists, somebody who definitely
can’t be replaced with machines in the near future, and the
engineer who is building these things and who’s at more risk
of being replaced by machines. And I actually found it very
interesting that you were pointing it out in the practical
tutorial as areas where students or any deep learning expert
or practitioner has to apply their creativity to come up with a
new architecture or a structure for a certain model.
Hadelin: That’s right. The deep learning scientist is definitely not only
an engineer, it’s also an artist because there is no rule of
thumb in making the perfect architecture to solve a specific
problem, so the deep learning scientist has to make some
sort of art to find the best model.
Kirill: Yeah. Fantastic! That summarizes our six models. And while
we have a bit of time still left, I wanted to get your comments
on the main two tools that we used. So we definitely covered
off different things in the course, we talk about scikit-learn,
we use all of the standard libraries, NumPy and so on
because it’s a Python-based course, but the two main tools
that we use are PyTorch and TensorFlow. TensorFlow is a
Google-developed library and PyTorch is I think a Facebook
and Yann LeCun-developed library. Can you give us a few
thoughts on that? What are the advantages, pros and cons?
Why did we use both in the course? Where do you think the
world is going and what should students or anybody getting
into the world of deep learning be focusing on at this stage?
Hadelin: What I think is that first at the start when we have to start
learning deep learning, I think the best option is to start
with Keras, which is a wrapper of TensorFlow. The big
advantage of TensorFlow is that it has Keras. That allows
you to build some deep learning models with only a few lines
of code. So basically you don’t have to implement the deep
learning models from scratch. That’s the big advantage of
this. Then we have PyTorch and PyTorch doesn’t yet have a
wrapper like Keras to implement deep learning models in a
few lines of code, but I think it’s actually more powerful than
TensorFlow because it can handle even more complex deep
learning models. These more complex deep learning models
are the dynamic graphs because if we go deeper into the
theory of deep learning, we will bump into the dynamic
graphs which are basically making a deep learning model
that is dynamic and no longer static.
The dynamic graph is a new specificity that PyTorch can
handle, unlike TensorFlow. I really encourage the students
to handle both libraries because maybe they will be able to
solve some problems with PyTorch that they couldn’t solve
with TensorFlow. Actually, PyTorch is very recent so there’s
still a lot of debate about that. I think it’s not only able to
solve some very complex problems, like dynamic graphs, but
also I think it’s very practical because students will see that
when we implement the models from scratch with PyTorch,
well it’s actually very practical and intuitive. Okay, it takes
some more lines of code than Keras requires, but we can see
in the end when we take a step back at what we
implemented that it’s actually very intuitive and then very
easy to change the architecture of the model. I highly
recommend PyTorch for two reasons: it is able to solve some
complex problems, and the very practical side. But for
beginners, I recommend more TensorFlow and Keras.
Kirill: Gotcha. Very, very good. I’m very excited that we covered off
both of these in the course. In the autoencoders and
Boltzmann machine side of things, you even go into things
like how to develop your own class and how to use that in
deep learning. That’s very powerful, I think.
Hadelin: That’s a big plus of the course. Students learn the important
tricks of Python and learn the most important techniques,
like building classes, because basically I’m sure that when
they start some projects of deep learning, they will have to
implement their own model. It is actually 100% sure that
they will have to make a class. Classes and objects are very
important to understand and handle in Python and I made
sure to explain all this and what’s the use of all this and
how we can use them to build some deep learning models
from scratch.
Kirill: Okay. Thank you very much for sharing all of that and for
coming on the show again. This brings us to the end of
today’s episode. One thing I would like to ask you to finalize
this show is what is the book that you can recommend to
people who are interested in learning more about deep
learning?
Hadelin: Okay. I highly recommend the “Deep Learning Book” by Ian
Goodfellow and Yoshua Bengio and Aaron Courville. It’s
actually a book that they can get online at
www.deeplearningbook.org. Basically, this is a really, really
good book even if they want to dive deep into the math and
the theory. It covers pretty much what we discuss in the
course, but more on the theory side. It’s a very, very good
book.
Kirill: It’s free, right?
Hadelin: It’s free, yes. And then I also recommend some other books.
Actually, on Facebook I’m part of a deep learning group
where we discuss the latest technologies in deep learning or
the latest theory that appears in the latest papers. And then
regularly we have some surveys or some questions that are
asked in this group, and one of the questions was “What is
the deep learning Bible? What is the best book people would
recommend?” Of course, there was the “Deep Learning
Book” that I just mentioned. Then there was also the book
“The Elements of Statistical Learning”. That’s an incredible
book because it will give you everything you need, all the
basics you need to really have a deep understanding of deep
learning.
Kirill: Gotcha. All right, thank you very much for sharing that. I
think this was quite an exciting session venturing into the
world of deep learning and hopefully quite a few interesting
things can be picked out from what we shared today. Once
again, thank you so much for coming on the show. I really
appreciate your time.
Hadelin: Thank you very much. I was very happy to come back again.
Kirill: Great. All right, see you soon. Take care.
Hadelin: Yeah, see you soon. Bye.
Kirill: So there you have it. That was deep learning for you, a
quick summary of all the six models. Of course, there’s
many more models in deep learning, but these are the six
main ones which we identified and covered off in Deep
Learning A-Z and hopefully in this podcast you got a quick
glimpse of what the world of deep learning is like and how
these models are structured and what they can be used for.
Personally, my favourite out of all the six is the Boltzmann
machine just because it’s so cool. Just because it has that —
as Hadelin mentioned, you can look at it in two different
ways. I look at it through the lens of an energy-based model,
so it uses a Boltzmann distribution. Actually, when we were
creating the course we modified the Wikipedia article for
Boltzmann distribution and that was so cool. There’s
actually a tutorial in the course where we go into the
Wikipedia article for Boltzmann distribution and add some
text, just one line about the Boltzmann machine.
It’s very close and dear to my heart because of what I used
to do in my bachelors degree with physics. It is a very
interesting way to model systems. The Boltzmann
distribution is actually used to model energy-based systems
that include things like entropy and the gas in the room
where you’re sitting in, it’s distributed according to a
Boltzmann distribution. It’s taking the state that has a
minimal energy cost. So that was very interesting for me, to
learn about and to kind of include tutorials about it. So
that’s my favourite for sure. And I’d be very curious to find
out what your favourite is. If you haven’t ventured into the
world of deep learning a lot yet, probably this is your first
intro to that world, and it’s going to be hard to decide at this
stage what your favourite is. But maybe if you do venture
into the world of deep learning one day, I’ll be really
interested to find out which one of those six models you
would prefer the most yourself.
So that’s that. Deep learning is definitely a field to be at least
aware of. That’s where the world is going. As we mentioned
in the course intro video, what we’re seeing is a shift from
machine learning to deep learning. We’re seeing that deep
learning methods are becoming so advanced and
sophisticated that just through their sheer complexity, not
in the sense of understanding them, but in the sense of how
complex the problems are that they can tackle, just through
that, they are edging out machine learning methods.
I think Ben Taylor on one of the previous episodes in the
podcast actually mentioned that you can beat the accuracy
of a machine learning algorithm with deep learning methods
very easily. For instance, on the MNIST dataset, which is a
dataset of handwritten digits, you have to be very, very
proficient in machine learning to get an accuracy anywhere
above 95%. On average, you’ll probably get like 90%-92%
accuracy. You could just be starting out into deep learning
and achieve an accuracy of 98% just because it’s that
powerful and you just have to code a few lines of code if you
use Keras, as Hadelin mentioned.
So deep learning is definitely something to be aware of.
That’s where all the technology is going and in my view, it’s
the future of data science. So hopefully this podcast gave
you a good overview of what to expect or the different types
of buzzwords and buzz terms you might be hearing in the
near future.
So there we go. I hope you enjoyed this episode and make
sure to leave a review on iTunes if you’re listening on iTunes.
It would really help us out and help spread the word about
this podcast. And on that note, I look forward to seeing you
next time. Until then, happy analyzing.
Top Related