Download - SDS PODCAST EPISODE 47 WITH - Amazon Web Services · Hadelin: Well we made the deep learning course! The big brother of the machine learning course. So yeah, this one is I think more

SDS PODCAST

EPISODE 47

WITH

HADELIN DE PONTEVES

http://www.superdatascience.com/47

Kirill: This is episode number 47 with Deep Learning Expert

Hadelin de Ponteves.

(background music plays)

Welcome to the SuperDataScience podcast. My name is Kirill

Eremenko, data science coach and lifestyle entrepreneur.

And each week we bring you inspiring people and ideas to

help you build your successful career in data science.

Thanks for being here today and now let’s make the complex

simple.


Welcome everybody back to the SuperDataScience podcast.

Super excited to have you on board, and today by popular

demand, we have a returning guest, Hadelin de Ponteves.

Hadelin was on the show for the first time about 6 months

ago, that was episode 2, and that was when we just released

the Machine Learning A-Z course. And now he's back, and

this time we've just released the Deep Learning A-Z course.

And so what is deep learning all about? Well deep learning is

an advanced branch of machine learning where we use

algorithms called neural networks to mimic the human

brain in order to be able to solve very complex problems.

And our goal with Hadelin in creating Deep Learning A-Z

was to create a course that has a robust structure and really

covers topics in a simple manner that is accessible to

anybody. So you don't have to be an expert in mathematics

or in programming or in anything else for that matter. You

just need a basic background in high school maths to

understand this course and to follow along.


And that's exactly what we created. It's released now and it

was actually also featured on Kickstarter, where it had

immense support from backers and we're very excited about

that. We're very excited to bring this course to the world and

today we're going to run through all the six different models

that we discuss in the course and give you a quick

breakdown.

And finally, for those of you who don't know Hadelin, I

wanted to mention that Hadelin has experience in deep

learning from Canal+, which is a competitor of Netflix, and

as well as that, Google. So this is definitely a person who

knows his way around deep learning and we're going to

learn quite a lot in today's podcast. So I'm very excited for

you to hear this episode and get your glimpse into the world

of deep learning and the different models that exist there.

And without further ado, I bring to you Hadelin de Ponteves.


Welcome everybody and welcome, welcome, welcome Hadelin

back to the show. How are you going, my friend?

Hadelin: I'm doing very well, thank you. I'm very happy to be back.

Kirill: Awesome. Well you definitely should be because 12,000

views as of today, your podcast, your first episode has

12,000 views. How do you feel about that?

Hadelin: Wow, that's amazing! I didn't expect that many views, so I

am very happy about that and I hope this has helped some

people and that people could be inspired from it.

Kirill: Yeah, I definitely hope so too. And I'm actually sure of it, and

it's raised some very interesting debates, especially the


podcast was about machine learning. But what raised

debates was your health, since you mentioned you were

sleeping 3 hours a day for the past 3 years.

Hadelin: I know right, some people were concerned! I got messaged on

LinkedIn to tell me that I should be careful! Thank you very

much for all of them.

Kirill: Yeah, that was fantastic. Everybody worried about Hadelin

out there, he's still alive. He's still fine, and as always, very

energetic. So, mate, what have you been up to? Or what

have we been up to over the past couple of months? What

projects have we been working on?

Hadelin: Well we made the deep learning course! The big brother of

the machine learning course. So yeah, this one is I think

more powerful because we dive deep into some more

advanced techniques and we code some more advanced

stuff, like we use classes and objects to implement some

deep learning moles from scratch. So I think it is quite new

and people will still improve a lot their skills even after doing

the machine learning course. So I think it's very

complementary and it's basically taking things at the next

level.

Kirill: Exactly. Exactly. And also I wanted to point out here that

our goal was to create the most disruptive course on deep

learning and really collate a lot of information from pretty

much everywhere and put in our knowledge and experience

into this course and our view on things and put together,

most importantly, not just one, not just two, not just three,

but six different models on deep learning. And I think we

finally can say that we have completed this course and it's


there and people are going through it and we're very, very

excited about it.

Hadelin: Yes, and actually I finished implementing the last, the very

last model of this course yesterday. It was the Boltzmann

Machine, which took 14 tutorials. And I was really happy to

finish it because it was quite a challenge. Because this is

one of the most advanced models in this course because it's

a probabilistic graphical model so it handles a lot of

probabilities and we had to dive into the MCMC, Markov

Chain Monte Carlo techniques, with Gibbs sampling, with

the random walks, so a lot of very cool mathematical

concepts, but we did it yesterday and I was very happy to

finish on that note.

Kirill: Fantastic, and congrats on that. It was definitely a big

course to tackle. And for everybody listening today, we're

going to talk about deep learning. This podcast is dedicated

to deep learning and summarising and running through all

of the things that we have covered in the course. And just to

give you, even if you're not taking the course, just to give

you a feel and taste for what deep learning is about, what

type of models exist out there, what type of approaches,

techniques, what are the use cases and applications. And so

we'll be running through six different models of deep

learning, giving you our comments on that and this is going

to be quite an exciting podcast. Really looking forward to

this session. How about you, mate? You excited about this?

Hadelin: Yeah, very excited.

Kirill: Alright. Ok. So let's kick things off with the very first, very

basic model, the artificial neural networks. It goes into the


foundation of all of the deep learning concepts and

principles and outlines everything there. And probably we'll

start off by saying that in artificial neural networks -- oh, by

the way, if you're listening to this podcast and you haven't

taken the course yet, then probably you should know that in

this course, just like in the machine learning course, I was

doing the intuition tutorials and Hadelin was doing the

practical application tutorials. So you might hear us

comment on the deep learning methodologies from both

sides exactly in that manner.

And in terms of artificial neural networks, we're trying to

model the human brain. So we're creating this structure

which is full of neurons which are interconnected. And the

fascinating thing, just by creating this course, I personally

learned a lot. Especially about the field of neuroscience. I

found out that in the brain we have 100 billion different

neurons and each one of them is connected to at least as

many as a thousand other neighbours. So I just wanted to

get your comment on that, Hadelin. What are modern deep

learning neural networks like? Are they anywhere close to

that size?

Hadelin: Well, at least that’s our goal. We are trying to build some

models that mimic the processes in the human brain. But,

of course, we’re not there today. As you said, there are

billions of neurons in the human brain and there are lots

and lots of connections between the neurons. So far, when

we talk about artificial deep learning, the artificial neural

networks we make when we solve some problems, they

contain several, maybe dozens of layers at most. So we’re

very far from what’s happening in the brain, but we’re trying


to mimic what’s happening in the brain, we’re trying to

reproduce the structure and the connections by adding

some mathematics into it, you know, to make some relevant

models that can solve complex problems with lots of non-

linear relationships. Only models that are close to how the

brain works can solve that kind of problems.

Kirill: Okay. And why is that? That’s actually a good segue into

what’s the whole point of deep learning? Why can’t we just

stick to machine learning and solve all of our problems with

machine learning?

Hadelin: Because basically there are limitations in machine learning.

In machine learning we can solve a lot of problems, but

when the problems become very complex because, you

know, problems are defined by their relationships, whether

they are linear, non-linear, how complex are the non-linear

relationships, and when we are reaching some high level of

complexities, because with deep learning we can extend the

complexity by adding some layers and some neurons, we can

basically solve more and more complex problems thanks to

the fact that we can add these layers and these neurons.

With machine learning, while you have some fixed models,

you cannot really extend the models of machine learning,

while you could for XGBoost, for example. XGBoost is

actually another great model that is used to solve very

complex problems. That is because you can add some trees

in XGBoost because XGBoost is like a Random Forest but in

a much more advanced construction. So that’s the thing –

you can add some level of complexities by adding some trees

or layers in the models and that’s how deep learning can

solve complex problems as opposed to machine learning.


Kirill: Okay, gotcha. So, basically, at some point you reach a level

of complexity. For example, recognizing objects in an image.

That’s pretty complex, right? And not just recognizing

objects from an algorithm which tells you “Look for this type

of pixel,” but automatically learning how to recognize objects

from many, many thousands of images. That’s a complex

problem.

Hadelin: That’s right. That’s a very complex problem. Not only do you

have to understand some patterns in thousands and

thousands of images, but also you have to understand some

patterns in the pixels. And there are thousands and

thousands of pixels so that makes the problem really, really

complex and, of course, this is not solvable with machine

learning, classic machine learning.

Kirill: Okay, gotcha. So, going back to ANNs, we introduced a

couple of concepts there, and I know it’s going to be

extremely—like, the course is how long, 20 hours or

something so far?

Hadelin: Yeah, little more than 20 hours. We’re going to add some

more tutorials.

Kirill: Yeah, it’s going to be extremely hard to convey some of these

topics in the podcast. But just quickly, we introduced a few

concepts and probably the key one is the activation function.

What happens in the activation function? Can you tell us a

bit about that?

Hadelin: Okay. First of all, there are different kinds of activation

functions. We have the rectifier activation function that is

used to break the linearity because that’s the whole point of

it. We are trying to solve non-linear problems. In order to be


able to solve these non-linear problems, we have to break

the linearity between one hidden layer to another hidden

layer. That’s what the rectifier function is for and actually

you explain it very well in one of your intuition lectures.

And then we have the sigmoid activation function, which is

another kind of activation function, that is used to output

the predictions in terms of probabilities. Instead of returning

an exact outcome, for example a binary outcome 0 or 1, you

will use the activation function to model the probability of an

outcome. So, for example, that will return 0.8, meaning that

the outcome will have 80% chance of being the right

prediction.

And then you have some other activation functions, but

basically the idea of an activation function is to activate the

neuron. That’s used to activate certain neurons in the

neural networks with certain weight, and the higher is the

weight, the more relevant will be the neuron.

Kirill: Yeah, gotcha. That is very similar to what’s going on in the

human brain, in our brains. As we’re just talking now,

what’s going on is neurons are firing up, they’re sending an

electrical impulse to the following neuron, then that neuron

is getting electrical impulses from many different neurons

around it, so up to a thousand different neurons. It’s

combining all of that, it’s making a decision whether it needs

to fire up or not, and then it’s passing on (or not) an

electrical signal of varying intensity to the next neuron and

so on.

And when you combine all of that, you have this whole huge

army of neurons sending around electrical signals and that


is what thought is, that is what all of these ideas that we

have, all of these concepts and our interaction with the

world, all of our senses, they all translate into that. It’s

fascinating when you think about it. All the thoughts that

we have are basically just electrical signals running around

in our heads.

Hadelin: That’s right, yes. That’s fascinating. And it’s fascinating that

we’re managing to reproduce that with the models that we

make ourselves.

Kirill: Yeah, totally. That’s very important, that we mention that.

So we have neurons in the artificial neural networks and we

have activation functions which kind of connect the neurons

and facilitate the interaction between the neurons. So in the

artificial neural network, we have three types of layers. Tell

us about that. What layers do we have in the ANN?

Hadelin: Okay. Of course, let’s start with the input layer. The input

layer is the layer that receives the observations. For

example, let’s say we’re trying to predict if some customers

are going to leave or stay in a company or leave or stay in a

bank. Well, the input layer would get the information of the

customers that will go through the network to be able to

then make some predictions. That’s the input layer. It just

receives the observations.

And then we have the hidden layers, and that’s where

everything happens. That’s where the learning happens.

That’s where the model is trying to learn how to make some

correlations between the information of the input layer and

the output, which is the final prediction. And this final

prediction is going to be compared to the real outcome and


that’s how the model is going to learn, because it’s going to

compare the prediction to the real outcome and, according

to the mistake it might make, it will correct what happened

before. That’s backpropagation. And backpropagation will

then correct the weight so that it can learn some better

correlations next time. So after the hidden layers we the

have output layer that get the final prediction.

Kirill: And the more hidden layers we add, the more complex

becomes our neural network. It’s harder to train, but it

might be able to solve more complex problems. Is that right?

Hadelin: Yes, that’s correct.

Kirill: Okay. And you mentioned a couple of interesting terms.

First of all, if I’m not familiar with neural networks I might

ask the question, “What is the purpose of building a neural

network if on the output you’re comparing the results to the

real outcome so that means you already have the real

outcome? What’s the whole point of modelling an outcome?”

Can you comment on that?

Hadelin: Sure. That’s because for any machine learning model, or any

deep learning model, there is a training phase and a testing

phase. In order for the model to learn something, it needs

the real outcomes to learn the correlations. Because if it

didn’t have the real outcomes, it couldn’t learn anything. It’s

like when a student is practicing for an exam and he’s

learning a lesson or a topic, it needs the real outcome when

it’s training so that it can evaluate how he understood the

course.

Well, that’s the same for a deep learning model. It needs a

training phase with real outcomes so that it can make some


predictions itself, but then it has to compare these

predictions to the real outcomes so that it can correct itself.

And then we have the test set. And on the test set we have

some totally new observations for which we don’t have the

real outcomes and this is what pure predictions are about.

This is really pure predictions where I don’t have anything to

compare them.

Kirill: Okay, gotcha. And in this case we actually compare to the

real outcomes. So some models train with the real outcomes

and some models still train, but without the real outcomes.

The two different ones are called supervised models and

unsupervised models. So artificial neural networks are a

type of supervised model and in total we looked at three

different supervised models and we looked at three different

unsupervised models.

And probably the last comment is on backpropagation.

You’ll be hearing the word ‘backropagation’ quite a lot. It is

relevant to supervised deep learning models and basically in

summary it’s exactly what Hadelin described. You compare

it to the real output, you find the error and then you

backpropagate – hence the name backpropagation – you

backpropagate that error through the network to update the

network in very simplistic terms, and that’s how the models

train. So that was artificial neural networks. Let’s move on

to number two: convolutional neural networks. CNNs – what

are they used for?

Hadelin: CNNs are mainly used for image detection, but they have

various other applications like text recognition. How does

image detection work? Well, the CNNs are going to try to


learn something like some patterns in the pixels and that’s

how they will detect some specific features and images to be

able in the end to recognize what is in the image. For

example, in our course we’re training an algorithm, a CNN

actually, that is learning to predict whether there is a cat or

a dog in images. To do this, it will try to understand some

patterns and the pixels of the images to detect some specific

features of dog and the specific features of cat. And that’s

how in the end it can manage to predict if there is a cat or a

dog in the image.

Kirill: Yeah. And the most fascinating thing for me is that it can

learn that – and we got a pretty good accuracy rate in the

course – it can learn whether it’s a cat or a dog just by

looking at lots and lots of images. How many images did we

have in the course?

Hadelin: 10,000. We’re training the CNN on 8,000 images and we’re

testing it on 2,000 images. And indeed, on the 2,000 images

we get a pretty good accuracy that is pretty good at correct

predictions.

Kirill: About what? What is it like?

Hadelin: I think we reached 82%.

Kirill: 82% accuracy?

Hadelin: Yes.

Kirill: Which is great. So, it means that CNN had a look at 8,000

images of dogs and cats which are labelled so it knows that

this folder has cats and this folder has dogs. It just looked

through them. Without anything else, no tricks or hacks, we

specified nothing. We just said, “Look at these images. These


are cats, these are dogs, and now decide for yourself what is

important for you in an image, what features are you going

to be looking for when you’re looking at new images to

decide whether it’s a dog or a cat.” And then after all the

training we gave it 2,000 images of dogs and cats and out of

those 2,000 it got 85% correct. So it identified 85% of the

images correctly, that these are dogs/cats. Without us

having to tell it how to do it, it learned the thing itself.

Hadelin: Yes, that’s right. And these are our new observations, new

images. And besides, this is without any parameter tuning,

because in the course we insist on parameter tuning and

that’s one of the homeworks we gave to the students. We let

the students work on the model to improve them, tune them.

Actually, one of the challenges is to get 90% accuracy, which

I know is possible so that’s why I gave the challenge. The

students who manage to reach that 90% accuracy, they will

have the gold medal.

Kirill: (Laughs) Has anybody gotten 90% accuracy yet?

Hadelin: Somebody got the gold medal on the other challenge, which

is for ANN. Actually, that was today, so I congratulated that

student. He got 87% accuracy, which is very good on the

other problem.

Kirill: I think I saw that message. His accuracy was even better

than in the training set. It was higher on the test set than on

the training set.

Hadelin: Absolutely. That’s the one, yes.

Kirill: Okay, gotcha. All right, very cool. So, that was CNNs – a very

brief intro into convolutional networks. And it’s very


important to understand the concept behind CNN because

that is not in its raw form what goes into self-driving cars,

but that is the direction in which things like self-driving cars

are aiming. They need to recognize pedestrians on the

streets, they need to recognize stop signs, they need to

recognize the colour of the traffic light to work completely

autonomously. That is your first step in that direction.

Would you say that is a fair summary?

Hadelin: Yeah, that’s absolutely a fair summary. Of course, CNNs are

used in self-driving cars to detect objects on the street,

which is absolutely compulsory.

Kirill: Yeah. And there was actually a challenge to see how well

computers can do to recognize different types of road signs

and right now they’re already doing better than humans. It’s

really interesting. All right, moving on to the next one:

recurrent neural networks.

Hadelin: Oh, that was quite a challenge.

Kirill: (Laughs) Yeah. For us, we spent so much time on recurrent

neural networks simply because of the challenge that we set

ourselves. We wanted to predict stock prices, but in the end

we found out that it’s too chaotic to predict with recurrent

neural networks to the extent that we attempted that

challenge. So probably if you spend more time you might be

able to find ways to do it better, but nevertheless a very

interesting type of neural network. This is a neural network

that has short-term memory. All neural networks have long-

term memory and that’s when you train them up, they

remember the structure and the configuration of the neural

network, will remember the training and then it will apply


that knowledge in the testing and hence it can recognize

dogs or cats or solve complex problems. But recurrent

neural networks actually have short-term memory, so if it’s

going through a dataset in row 51, it will remember what the

outcomes were for row 50 or row 49 or row 48 and so on.

What kind of applications does that open up for the world of

deep learning?

Hadelin: Well, there are many applications. For example, there are

applications for natural language processing. You can use

recurrent neural networks for natural language processing if

you want to predict what’s going to be the next step in a

sequence of text. So, for example, you can predict what’s

going to be the last word in a sentence and more, so you can

increase the complexity of the problem, and it can also be

used for video classification. For example, if you want to

predict what’s going to happen next in a video, then in that

case you will need to combine convolutional neural networks

to RNN because RNN are basically used to predict a time

series, like what’s going to be the next step in a series of

events. So that can be used for text, as I said, or for videos

as well.

Kirill: Okay, gotcha. So, RNNs are very often used in combination

with other algorithms like CNNs. And you can train up an

RNN to basically just go through like a huge amount of text

and learn from it how sentences are structured, how words

follow each other. That way you train up a deep learning

model to create sentences, in essence.

In the course we actually mention a small video, a 9-minute

film, called “Sunspring” which was entirely written by an


RNN, by specifically an LSTM, so long short-term memory

type of RNN which was trained up on thousands of sci-fi

films and then it wrote its own sci-fi film and people actually

acted it out, professional actors acted it out, and it

participated in the – I think it was the Sci-Fi London Film

Festival and it was rated – I think it was in the top 10. So

that’s pretty exciting. That’s where the world is going.

Hadelin: Yes. That’s a pretty exciting application.

Kirill: Yeah. And you can think of lots of other applications.

Basically, RNNs are there to facilitate the short-term

memory which we humans have. It’s so powerful. We don’t

just have long-term memory. If you just had long-term

memory then you wouldn’t remember the start of this

podcast and would be very sad. You’d be sitting there

thinking “What are we talking about? What is this whole

topic right now?” Short-term memory is a very important

tool that we have as humans and therefore, why would we

deprive deep learning models of that concept? And that’s

why RNNs exist. And probably the last thing on RNNs I

wanted to mention here is LSTMs. Can you tell us a bit

about LSTMs? This is actually a huge breakthrough for the

world of RNNs. What is it all about?

Hadelin: Yes, this is actually the disruption in RNNs because, as you

said, the classic RNNs have short memory, but the LSTM is

the first RNN to be able to learn long-term relationships.

Basically, it’s the first RNN to have long memory. That’s why

it’s called LSTM – long short-term memory. Basically, that’s

the most powerful RNN model and that’s the one we

implement in our course.


Kirill: Yeah. I think it was created in Germany in the 90s.

Hadelin: Yes, 1997 or something.

Kirill: Yeah, it’s pretty cool. And actually it’s very interesting,

throughout the course we mentioned the creators of these

models and people who came out with them. We’ve got

Geoffrey Hinton, we’ve got Yann LeCun, Yoshua Bengio. All

of these scientists, I think they’re all from Canada, if I’m not

mistaken.

Hadelin: Yann LeCun is French.

Kirill: Yann LeCun is French? Okay. That’s right. But now they live

in Canada/America and so they have their own little circle

which I think Yann LeCun calls the conspiracy of deep

learning. It’s very interesting. And then all of a sudden you

have somebody from Germany creating the LSTM. When I

found out about them, where the LSTMs came from, it was a

pleasant surprise that the whole world is actually

contributing to this movement. It’s great.

Okay, next we are moving into the world of unsupervised

deep learning. First of all, can you tell us a bit about

unsupervised? What does that mean, ‘unsupervised’?

Hadelin: Well, unsupervised means that basically you don’t have an

outcome to compare your prediction to, so basically what

you have to do is identify yourself some structure in the data

that will become your future predictions. So, basically, when

we’re starting with unsupervised learning, we don’t have a

dependent variable. We don’t have something that we want

to explain. But by identifying some segments or some

clusters or some structures in the data, we will eventually


end up with such a dependent variable. And actually, in the

course we highlight this transition from unsupervised to

supervised because we once we complete our unsupervised

deep learning model, we end up with a dependent variable

that can lead our model to become a supervised deep

learning model.

Kirill: Yeah, that’s definitely a very powerful thing. Unsupervised

models, I think they’re more complex generally because each

one of the models that we discuss has its own very unique

approach to learning, kind of a trick or a hack or a whole

new concept that it introduces in order to bypass the fact

that it doesn’t have this output to compare to.

We start off by talking about SOMs – self-organizing maps.

These neural networks were first created in I think 1982 by

Teuvo Kohonen, a Finnish professor, and they are very

interesting. By far, they are the simplest out of the whole

course just because they are so elegant and the idea behind

them is so straightforward, even the mathematics. We don’t

talk about mathematics in the course, we don’t go deep into

the mathematics so we don’t get bogged down, but the

mathematics driving the other models in the course, they’re

pretty complex. But for self-organizing maps, they are very

straightforward, they’re very easy to code even from scratch.

We talk about self-organizing maps and we find out how

exactly they work. So what can you tell us about SOMs,

Hadelin?

Hadelin: Okay. First, what is the purpose of SOMs? Well, it’s to detect

some features in a very complex data. There is a high

dimensional dataset again full of non-linear relationships. It


will detect some features which we have absolutely no idea

what they are, but it will detect some features inside this

data. And how will it do that? Well, it will do that by

reducing the dimensionality of the dataset. That’s why at the

beginning we start with a lot of the features that are

columns in the dataset, and a lot of observations, and

eventually we end up having this two-dimensional map. And

on this two-dimensional map we can see some neurons that

we call the ‘winning nodes’. Basically each cluster, or each

winning node, is detecting a certain feature. That’s pretty

powerful because even by having a very complex dataset at

the beginning, we end up with this cool two-dimensional

map that is very visual and that we can use directly to detect

some specific features or see some specific clusters,

segments in the dataset. And actually we implement a self-

organizing map to detect fraud and that’s because fraud is a

specific feature detected by the SOM.

Kirill: Yeah. It’s a very visual type of algorithm.

Hadelin: Yeah, it’s very visual.

Kirill: So, you might have a huge dataset with lots and lots of

columns which there is no way you would be able to

visualize in a concise way, but then the SOM allows you to

reduce that, put it on a map, and then you can see all of the

connections.

In the intuition side of things we walk through an

application of SOMs to the – I think the U.S. Senate or

something, part of the United States government, how they

vote, and you can see how they’re all plotted in a self-

organizing map and then you can see how to read that map


and understand what it’s saying about how they’re voting,

Republicans versus Democrats and so on. In our practical

tutorials, we actually have a very interesting application. Tell

us a bit more. What did you prepare for us in the practical

side of things to kind of showcase how SOMs can be used?

Hadelin: Okay. So, the data sets contain some credit card

applications. Some customers – well some not-yet customers

– some people applying to have an advance credit card in

their bank and basically to apply for this credit card they

need out a paper and provide a lot of information like their

credit score or other type of financial information like their

estimated salary.

And basically at the end of the application there is a yes or

no whether they got approved for the credit card. Of course,

like in any application, there are some people that can

cheat, and the goal is to find the potential cheaters, to detect

the potential frauds in the applications. So we have no idea

how to visualize that at the beginning because we have a lot

of information, all the information that was filled in by the

customers, but the SOM will manage to detect the frauds by

detecting some specific features in the self-organizing map.

I’m not going to say right now what are going to be the

features in the SOM because it’s actually something that the

students have to guess at some point, but these features,

these frauds, are pretty specific, are pretty visual in the self-

organizing map so we can really identify them well.

Kirill: Yeah. I found that that was a very cool application of self-

organizing maps. When you came up with that challenge for

the course I was very excited because I’ve worked with fraud


analytics before, back when I was at Deloitte, but I’ve never

actually seen self-organizing maps applied to solve the

problem and I think that this approach is very interesting

because it does allow you to go through lots and lots of data

without having to supervise the model, without having to

come up with these inputs at the start and therefore it can

really find it. It’s like a robot or a computer looking for

people who have committed fraud. Like, what are a human’s

chances against a machine? Zero, right? Pretty much zero.

Hadelin: Zero. Let’s hope this doesn’t go too far.

Kirill: Yeah. Let’s hope it doesn’t turn into World War III or

something like that.

Hadelin: Exactly.

Kirill: Yeah. And in terms of SOMs, what I found when I was

creating the intuition tutorials, I found that they’re really

very different to all the other five models that we discussed

and I would never have thought of classifying them as a

deep learning model. I always thought SOMs are just a type

of dimensionality reduction model. I think the lack of

backpropagation or the lack of more interconnectivity

between neurons makes me kind of think that in some way

they’re just a bit too simple to be considered as a deep

learning model. What are your thoughts on that? Is it maybe

just because they’re so elegant that they give you this

impression?

Hadelin: Well, first of all, I agree with you. I had sort of the same

feeling when I started studying about SOMs. But with no

hesitation I wanted to include them in the course because

they actually involve neurons. The points in the grids are


actually neurons and neurons are attracting other neurons

around them according to how they are similar to these

neurons. Of course these are very different from the other

neural networks that we implement in the course, but this is

still a neural network in two dimensions having several

neurons. It’s considered as a neural network, and that’s why

it’s considered as deep learning. But you’re definitely right

that this might be the most simple deep learning model in all

these neural networks.

Kirill: Gotcha. But even saying that it’s simple, the applications are

massive. Maybe the simplicity facilitates more applications.

We briefly mentioned a paper in the course about how SOMs

are used to analyse the probability density function of

photometric redshifts, so basically an application in

astronomy. Then we look at an example of how it’s applied to

World Bank data to look at different countries and their

prosperities, or poverty and then plot that on a map – and

plot that on an actual world map. So applications are

immense, limitless for self-organizing maps, and maybe that

has something to do with the simplicity.

Hadelin: That’s right.

Kirill: But moving on, next one is our favourite or probably your

favourite algorithm – the Boltzmann machine. By far, hands

down, the most complex algorithm that exists on this planet.

It was so much fun preparing tutorials, for me anyway. I

know that you did 14 tutorials about Boltzmann machine

and spent like over a week on that just recording.

Hadelin: Yeah, I was very excited recording that.


Kirill: How did that go? Tell us about Boltzmann machines. What

are they all about?

Hadelin: Well, first of all, they are very broad. The most important

thing is that this is a rupture between what we had before

because what we had before were sequential models with a

sequence of layers. We started with the input layer and then

we had some sequence of hidden layers and then eventually

the output layer.

And here, that’s totally different. We now have some neurons

and all the neurons are connected to each other and there’s

no longer input neurons and an output layer. Actually, what

happens is that we have some visible nodes and some

hidden nodes and basically the graph – because that

becomes a computational graph – the graph is updating

itself and the input nodes are getting updated so that in the

end they become the output nodes, but there is no output

layer.

It’s like a graph full of probabilities, because basically

Boltzmann machines are probabilistic graphical models, and

this is a graph that is updating itself and in the end it’s

maximizing what we call the likelihood that allows to make

the nodes all relevant to each other with some relevant

outputs, which in the end are predictions. Because of all

these complexities, because that involves a large number of

computations, we have what we call the ‘restricted

Boltzmann machine’ when we have to filter the connections

between the nodes. And in the restricted Boltzmann

machine, all of the nodes are no longer connected to each


other. Only the visible nodes are connected to the hidden

nodes and vice versa.

Kirill: Yeah. And it’s just a fascinating type of model. I’ll give you

an example of how it’s so different to everything else that

we’ve seen before. This is an example that we use in the

intuition tutorials in the course. Imagine a nuclear power

plant which generates electricity. That in itself is a system, is

a huge system which has lots of parameters. You have the

speed of a wind turbine, you have the temperature inside the

main core of the power plant, you have the pressure in

certain water pumps. You have lots of parameters that

govern how this facility is functioning. But there’s also lots

of parameters that are out of your control, parameters that

you can’t measure, which might be, for instance, the

moisture of the soil in a certain location. There’s so many

things, so many moving parts in this whole system, you

can’t measure them all at once.

What a Boltzmann machine does, and that’s why it’s a

probabilistic model, it generates all of these different states

of this nuclear power plant just randomly and then based on

your inputs, you’re able to tweak the Boltzmann machine for

it to be a better representation of your specific nuclear power

plant. Not just any nuclear power plant in the world,

possible or impossible, you kind of restrict it — and this is

not in any way connected with the term ‘restricted

Boltzmann machine’, those are different. Anyway, you

restrict this whole Boltzmann machine to being a

representation of your nuclear power plant and that allows

you to model very interesting things. And why is a nuclear

power plant a good example? Because you cannot model all


of the scenarios in a nuclear power plant through supervised

deep learning because in supervised deep learning you need

a training set. And you just don’t have, and it’s impossible to

have, lots of training data on nuclear power plant

meltdowns.

So if you want to model all the possible situations in which a

nuclear power plant explodes or disasters happen on it, in

order to be able to prevent them, you cannot do that through

supervised learning just because you don’t have the training

data. And that is where unsupervised models, for example,

Boltzmann machines, come in and they can really help out

with this situation because they are generating these

scenarios on their own at random and that allows you to go

venture into the scenarios that haven’t even happened in

real life.

Hadelin: Right. And besides, by giving this example you explain the

Boltzmann machines from the energy-based point of view

because actually a Boltzmann machine can be seen on two

different points of view. The first one is an energy-based

model, so exactly as you just explained, and the second

point of view is that it’s also a probabilistic graphical model,

and that’s what we focus more during the practical

applications. It’s good that students get to see both points of

view.

Kirill: And probably if you want to challenge yourself to do

something extremely interesting and complex at the same

time, in terms of just grasping and getting your head around

it, Boltzmann machines are the way to go. There are a

number of things that you’ll probably find challenging to get


your head around in the space of deep learning that are very

significantly reduced because if you understand Boltzmann

machines, then anything else is going to be a piece of cake.

That’s a very challenging topic but it’s also worth attempting

that challenge.

All right, moving on to final model in the course: the

autoencoders. Tell us a bit more about autoencoders. Where

does the name come from?

Hadelin: Well, autoencoders are my personal favourite because —

well, I don’t know if they’re my personal favourite because I

really like Boltzmann machines too, but I like autoencoders

because basically they’re quite simple, especially after

studying Boltzmann machines, and at the same time they’re

capable of solving extremely complex problems and that’s

because they’re stacked autoencoders.

Basically, we implemented a recommender system with

Boltzmann machines that can predict binary ratings and

with the autoencoders we would take it to the next level by

predicting some ratings that are from 1 to 5, which is a more

complex problem. And yet, the autoencoders is a more

simple problem because basically the simple autoencoder is

composed of three layers: the input layer that gets the

neuron, the observations, the hidden layer that is a layer

with a small number of nodes compared to the input layer,

and we have the output layer. By putting the observations

into the [indecipherable 47:39] we’re trying to reconstruct

the input observations in the output layer.

That’s how it works and that’s why it’s called autoencoders,

because basically what happens is a two-step process. The


first step is the encoding step, when we try to encode the

observations into this hidden layer composed of a fewer

number of neurons, and then there is a decoding step, when

we decode the hidden layer to reconstruct the hidden layer

into the input layer. So, we’re trying to replicate the input

layer by decoding it.

Kirill: That’s a great summary. That’s what I meant when I said

that all of these 3 unsupervised deep learning models, they

have their own ways to get around the fact that they don’t

have the data that they need to look at the real outcome.

And the way that autoencoders get around that is they make

the input be the outcome that they’re aiming toward. In a

way, they’re not purely unsupervised, sometimes they’re

called ‘self-supervised deep learning model’ because they are

in essence supervising themselves through the inputs that

they’re aiming to recreate as outputs. It’s their way of

cheating the system a little bit, but nevertheless they are

extremely powerful. And I remember during the course when

we were creating it – or just before we created it – you were

super excited that there was some breakthrough in stacked

autoencoders and you were like, “We have to include this in

the course. We definitely have to. This is all so brand new.”

Hadelin: Yes. And we actually implemented stacked autoencoders

because basically stacked autoencoders is what I just

explained but with several hidden layers. So that means that

there are several encodings and several decodings and that’s

exactly what we implement. I think we have two or three

hidden layers, but then the challenge, of course, is to change

the architecture of the model and that’s very fun. Students

will learn how to change the architecture of the model to add


more layers and to add more nodes to tune the number of

nodes and other parameters. So, they will be some sort of

artists trying to create some other structures of stacked

autoencoders. And that’s a pretty good challenge.

Kirill: That really ties in with what you said on the first podcast,

which was six months ago — can you believe it has been six

months since then?

Hadelin: Time has flown.

Kirill: Yeah. And I think what you mentioned there was the artist

versus engineer. That’s two categories of data scientists.

You’re going to have the artists, somebody who definitely

can’t be replaced with machines in the near future, and the

engineer who is building these things and who’s at more risk

of being replaced by machines. And I actually found it very

interesting that you were pointing it out in the practical

tutorial as areas where students or any deep learning expert

or practitioner has to apply their creativity to come up with a

new architecture or a structure for a certain model.

Hadelin: That’s right. The deep learning scientist is definitely not only

an engineer, it’s also an artist because there is no rule of

thumb in making the perfect architecture to solve a specific

problem, so the deep learning scientist has to make some

sort of art to find the best model.

Kirill: Yeah. Fantastic! That summarizes our six models. And while

we have a bit of time still left, I wanted to get your comments

on the main two tools that we used. So we definitely covered

off different things in the course, we talk about scikit-learn,

we use all of the standard libraries, NumPy and so on

because it’s a Python-based course, but the two main tools


that we use are PyTorch and TensorFlow. TensorFlow is a

Google-developed library and PyTorch is I think a Facebook

and Yann LeCun-developed library. Can you give us a few

thoughts on that? What are the advantages, pros and cons?

Why did we use both in the course? Where do you think the

world is going and what should students or anybody getting

into the world of deep learning be focusing on at this stage?

Hadelin: What I think is that first at the start when we have to start

learning deep learning, I think the best option is to start

with Keras, which is a wrapper of TensorFlow. The big

advantage of TensorFlow is that it has Keras. That allows

you to build some deep learning models with only a few lines

of code. So basically you don’t have to implement the deep

learning models from scratch. That’s the big advantage of

this. Then we have PyTorch and PyTorch doesn’t yet have a

wrapper like Keras to implement deep learning models in a

few lines of code, but I think it’s actually more powerful than

TensorFlow because it can handle even more complex deep

learning models. These more complex deep learning models

are the dynamic graphs because if we go deeper into the

theory of deep learning, we will bump into the dynamic

graphs which are basically making a deep learning model

that is dynamic and no longer static.

The dynamic graph is a new specificity that PyTorch can

handle, unlike TensorFlow. I really encourage the students

to handle both libraries because maybe they will be able to

solve some problems with PyTorch that they couldn’t solve

with TensorFlow. Actually, PyTorch is very recent so there’s

still a lot of debate about that. I think it’s not only able to

solve some very complex problems, like dynamic graphs, but


also I think it’s very practical because students will see that

when we implement the models from scratch with PyTorch,

well it’s actually very practical and intuitive. Okay, it takes

some more lines of code than Keras requires, but we can see

in the end when we take a step back at what we

implemented that it’s actually very intuitive and then very

easy to change the architecture of the model. I highly

recommend PyTorch for two reasons: it is able to solve some

complex problems, and the very practical side. But for

beginners, I recommend more TensorFlow and Keras.

Kirill: Gotcha. Very, very good. I’m very excited that we covered off

both of these in the course. In the autoencoders and

Boltzmann machine side of things, you even go into things

like how to develop your own class and how to use that in

deep learning. That’s very powerful, I think.

Hadelin: That’s a big plus of the course. Students learn the important

tricks of Python and learn the most important techniques,

like building classes, because basically I’m sure that when

they start some projects of deep learning, they will have to

implement their own model. It is actually 100% sure that

they will have to make a class. Classes and objects are very

important to understand and handle in Python and I made

sure to explain all this and what’s the use of all this and

how we can use them to build some deep learning models

from scratch.

Kirill: Okay. Thank you very much for sharing all of that and for

coming on the show again. This brings us to the end of

today’s episode. One thing I would like to ask you to finalize

this show is what is the book that you can recommend to


people who are interested in learning more about deep

learning?

Hadelin: Okay. I highly recommend the “Deep Learning Book” by Ian

Goodfellow and Yoshua Bengio and Aaron Courville. It’s

actually a book that they can get online at

www.deeplearningbook.org. Basically, this is a really, really

good book even if they want to dive deep into the math and

the theory. It covers pretty much what we discuss in the

course, but more on the theory side. It’s a very, very good

book.

Kirill: It’s free, right?

Hadelin: It’s free, yes. And then I also recommend some other books.

Actually, on Facebook I’m part of a deep learning group

where we discuss the latest technologies in deep learning or

the latest theory that appears in the latest papers. And then

regularly we have some surveys or some questions that are

asked in this group, and one of the questions was “What is

the deep learning Bible? What is the best book people would

recommend?” Of course, there was the “Deep Learning

Book” that I just mentioned. Then there was also the book

“The Elements of Statistical Learning”. That’s an incredible

book because it will give you everything you need, all the

basics you need to really have a deep understanding of deep

learning.

Kirill: Gotcha. All right, thank you very much for sharing that. I

think this was quite an exciting session venturing into the

world of deep learning and hopefully quite a few interesting

things can be picked out from what we shared today. Once


again, thank you so much for coming on the show. I really

appreciate your time.

Hadelin: Thank you very much. I was very happy to come back again.

Kirill: Great. All right, see you soon. Take care.

Hadelin: Yeah, see you soon. Bye.

Kirill: So there you have it. That was deep learning for you, a

quick summary of all the six models. Of course, there’s

many more models in deep learning, but these are the six

main ones which we identified and covered off in Deep

Learning A-Z and hopefully in this podcast you got a quick

glimpse of what the world of deep learning is like and how

these models are structured and what they can be used for.

Personally, my favourite out of all the six is the Boltzmann

machine just because it’s so cool. Just because it has that —

as Hadelin mentioned, you can look at it in two different

ways. I look at it through the lens of an energy-based model,

so it uses a Boltzmann distribution. Actually, when we were

creating the course we modified the Wikipedia article for

Boltzmann distribution and that was so cool. There’s

actually a tutorial in the course where we go into the

Wikipedia article for Boltzmann distribution and add some

text, just one line about the Boltzmann machine.

It’s very close and dear to my heart because of what I used

to do in my bachelors degree with physics. It is a very

interesting way to model systems. The Boltzmann

distribution is actually used to model energy-based systems

that include things like entropy and the gas in the room

where you’re sitting in, it’s distributed according to a


Boltzmann distribution. It’s taking the state that has a

minimal energy cost. So that was very interesting for me, to

learn about and to kind of include tutorials about it. So

that’s my favourite for sure. And I’d be very curious to find

out what your favourite is. If you haven’t ventured into the

world of deep learning a lot yet, probably this is your first

intro to that world, and it’s going to be hard to decide at this

stage what your favourite is. But maybe if you do venture

into the world of deep learning one day, I’ll be really

interested to find out which one of those six models you

would prefer the most yourself.

So that’s that. Deep learning is definitely a field to be at least

aware of. That’s where the world is going. As we mentioned

in the course intro video, what we’re seeing is a shift from

machine learning to deep learning. We’re seeing that deep

learning methods are becoming so advanced and

sophisticated that just through their sheer complexity, not

in the sense of understanding them, but in the sense of how

complex the problems are that they can tackle, just through

that, they are edging out machine learning methods.

I think Ben Taylor on one of the previous episodes in the

podcast actually mentioned that you can beat the accuracy

of a machine learning algorithm with deep learning methods

very easily. For instance, on the MNIST dataset, which is a

dataset of handwritten digits, you have to be very, very

proficient in machine learning to get an accuracy anywhere

above 95%. On average, you’ll probably get like 90%-92%

accuracy. You could just be starting out into deep learning

and achieve an accuracy of 98% just because it’s that


powerful and you just have to code a few lines of code if you

use Keras, as Hadelin mentioned.

So deep learning is definitely something to be aware of.

That’s where all the technology is going and in my view, it’s

the future of data science. So hopefully this podcast gave

you a good overview of what to expect or the different types

of buzzwords and buzz terms you might be hearing in the

near future.

So there we go. I hope you enjoyed this episode and make

sure to leave a review on iTunes if you’re listening on iTunes.

It would really help us out and help spread the word about

this podcast. And on that note, I look forward to seeing you

next time. Until then, happy analyzing.