presentation SpringSchool2019d.pptx [Autosaved]€¦ · whps iuht gldjqrvwlf +hdowk\ 6lfn 6lfn «...

Post on 13-Jul-2020

12 views 0 download

Transcript of presentation SpringSchool2019d.pptx [Autosaved]€¦ · whps iuht gldjqrvwlf +hdowk\ 6lfn 6lfn «...

Basics of deep learning

By

Pierre-Marc Jodoin and Christian Desrosiers

Lets start with a simple example

From Wikimedia Commonsthe free media repository

( temp, freq) diagnostic

(37.5, 72) Healthy

(39.1, 103) Sick

(38.3, 100) Sick

(…) …

(36.7, 88) Healthy

Freq

temp

Sick

Healthy

Patient 1

Patient 2

Patient 3

Patient N

D

x t

D

Lets start with a simple example

Freq

temp

Sick

Healthy

A new patient comes to the hospitalHow can we determine if he is sick or not?

From Wikimedia Commonsthe free media repository

Lets start with a simple example

Freq

temp

Divide the feature space in 2 regions : sick and healthy

Solution

From Wikimedia Commonsthe free media repository

Freq

temp

Sick

Healthy

Divide the feature space in 2 regions : sick and healthy

Solution

From Wikimedia Commonsthe free media repository

Freq

temp

t=Sick

t=Healthy

otherwise

region blue in the is ifealthy

Sick

xHxy

More formally

From Wikimedia Commonsthe free media repository

How to split the featurespace?

Definition … a line!

x

ybmxy

x

y

x

ym

slope bias

b, bias

x

ybmxy

x

y bxx

yy

x

ym

xbyxxy

xbxyyx 0

Definition … a line!

b, bias

xbxyyx 0

1w 2w 0w

Rename variables

x

y

1x

2x

Rename variables

0210 wywxw

022110 wxwxw

1x

2x

02211 wxwxwxyw

0xyw

0xyw

0xyw

Classification function

),( 21 wwN

1x

2x

002211 wxwxw

0.4

0.2

0.1

0

2

1

w

w

w

)3,2(

)6,1(

0702211 wxwxw

)1,4(

0602211 wxwxw

Classification function

02211 wxwxwxyw

1x

2x

02211 wxwxwxyw

0xyw

0xyw

0xyw

Classification function

),( 21 wwN

21210

02211

,,1,, xxwww

wxwxwxyw

DOT

product

'x

w

1x

2x

02211 wxwxwxyw

0xyw

0xyw

0xyw

Classification function

),( 21 wwN

'

,,1,, 21210

02211

xw

xxwww

wxwxwxy

T

w

DOT

product

To simplify notation

xwxy Tw

linear classification = dot product with bias included

Freq

0tSick

Healthy

With the training dataset D

the GOAL is to

find the parameters that would best separate the two classes.

210 ,, www 0xyw

0xyw

Learning

How do we know if a model is good?

Freq

0t

FreqFreq

0t0t

0, DxyL w

0, DxyL w

0, DxyL w

Loss function

Good! Medium BAD!

4. Training : find that minimize

So far…

1. Training dataset: D

2. Classification function (a line in 2D) :

3. Loss function:

02211 wxwxwxyw

210 ,, www

DxyL w ,

DxyL w ,

Perceptron Logistic regressionMulti-layer perceptronConv Nets

Today

22

Perceptron

27

Rosenblatt, Frank (1958), The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, Psychological Review, v65, No. 6, pp. 386–408

Perceptron

1

1w

2w

0w

(2D and 2 classes)

)(xyw

1x

2x

w

1x

2x0)( xyw

0)( xyw

0)( xyw

xw

xwxwwxyw

T

12110

)(

xw T

28

sign }1,1{)( xyw

Activation function

xw T

xwsignxyw

T

Perceptron(2D and 2 classes)

29

1x

2x0)( xyw

0)( xyw

0)( xyw

w

1

1w

2w

0w

1x

2x

NeuronDot product + activation function

sgxw T 1,1T xwsignxyw

Perceptron(2D and 2 classes)

30

w

1x

2x0)( xyw

0)( xyw

0)( xyw

1

1w

2w

0w

1x

2x

So far…

1. Training dataset: D

2. Classification function (a line in 2D) :

3. Loss function:

02211 wxwxwxyw

DxyL w ,

4. Training : find that minimize

So far…

1. Training dataset: D

2. Classification function (a line in 2D) :

3. Loss function:

210 ,, www

DxyL w ,

DxyL w ,

sg

1

1w

2w

0w

1x

2x xw T xyW

Linear classifiers have limits

Non-linearly separable training data

Linear classifier = large error rate

Three classical solutions

1. Acquire more observations

2. Use a non-linear classifier

3. Transform the data

Non-linearly separable training data

Three classical solutions

1. Acquire more observations

2. Use a non-linear classifier

3. Transform the data

Non-linearly separable training data

D

x

( temp, freq, headache) Diagnostic

(37.5, 72, 2) healthy

(39.1, 103, 8) sick

(38.3, 100, 6) sick

(…) …

(36.7, 88, 0) healthy

Patient 1

Patient 2

Patient 3

Patient N

D

x

t

( temp, freq) diagnostic

(37.5, 72) healthy

(39.1, 103) sick

(38.3, 100) sick

(…) …

(36.7, 88) healthy

Patient 1

Patient 2

Patient 3

Patient N

D

t

Acquire more data

10

8

6

4

2

0

02211 wxwxwxyw

(line)

0332211 wxwxwxwxyw

(plane)

Non-linearly separable training data

39

sgxw T 1,1T xwsignxyw

Perceptron(2D and 2 classes)

02211 wxwxwxyw

(line)

1x

2x0)( xyw

0)( xyw

0)( xyw

w

1

1w

2w

0w

1x

2x

40

Perceptron(3D and 2 classes)

sg

1

1w

2w

0w

1x

2x

xw T 1,1T xwsignxyw

1)( xyw

1)( xyw

Example 3D

0332211 wxwxwxwxyw

(plane)

3x 3w

41

Perceptron(K-D and 2 classes)

sg

1

1w

2w

0w

xw T 1,1T xwsignxyw

0332211 wxwxwxwxwxy KKw

(hyperplane)

1Kw

Kw

(…)

1x

2x

1Kx

Kx

Learning a machine

The goal: with a set of training data , estimate so:

In other words, minimize the training loss

Optimization problem

ntxy nnw

NN txtxtxD ,,...,,,, 2211

,,1

N

nnnww txylDxyL

42

w

Freq

0t

FreqFreq

0t0t

Loss function

0, DxyL w

0, DxyL w

0, DxyL w

44

DxyL w ,

1w2w

Perceptron

Question: how to find the best solution?

45

0, DxyL w

Random initialization

DxyL w ,

1w2w

[w1,w2]=np.random.randn(2)

Gradient descent

Question: how to find the best solution?

DxyLηww [k]w

[k]][k ,1

Gradient of the loss function

Learning rate

47

0, DxyL w

Perceptron Criterion (loss)

A wrongly classified sample is when

Consequently is ALWAYS positive for wrongly classified samples

.1et 0

or

1et 0

T

T

nn

nn

txw

txw

Tnntxw

Observation

50

Perceptron Criterion (loss)

15.464, DxyL w

samples classified wrongly ofset theis V ere wh, T

Vx

nnw

n

txwDxyL

51

So far…

1. Training dataset: D

2. Linear classification function:

3. Loss function:

02211 wxwxwxwxy MMw

, T

Vx

nnw

n

txwDxyL

4. Training : find that minimizes

So far…

1. Training dataset: D

2. Linear classification function:

3. Loss function:

w DxyL w ,

sg

1

1w2w

0w

1x

2x

xw T 1,1T xwsignxyw

3x

Kx

3w

Kw

(…)

DxyLw ww

,minarg

0, DxyL w

, T

Vx

nnw

n

txwDxyL

Optimisation

Lkkk ][][]1[ ww

Gradient of the loss function

learning rate

Stochastic gradient descent (SGD)

Initk=0DO k=k+1

FOR n = 1 to N

UNTIL every data is well classified or k== MAX_ITER

nk xLww

][

54

w

Perceptron gradient descent

Vx

nnw

n

xtDxyL

,

NOTE : learning rate :

• Too low => slow convergence• Too large => might not converge (even diverge)• Can decrease at each iteration (e.g. )kcstk /][ 55

, T

Vx

nnw

n

txwDxyL

Stochastic gradient descent (SGD)

Initk=0DO k=k+1

FOR n = 1 to NIF THEN /* wrongly classified */

UNTIL every data is well classified OR k=k_MAX

nnxtww

0T nntxw

w

Similar loss functions

N

nn

Tnw xwt,DxyL

1

0max,

“Hinge Loss” or “SVM” Loss

samples classified wrongly ofset thes V ere wh, T itxwDxyLVx

nnw

n

N

nn

Tnw xwt,DxyL

1

10max,

58

Multiclass Perceptron

1 1,0w

2,0w

0,0w

2221120222

2211110111

2201100000

xwxwwxw)x(y

xwxwwxw)x(y

xwxwwxw)x(y

,,,T

,w

,,,T

,w

,,,T

,w

(2D and 3 classes)

1,1w

2,1w

0,1w

1,2w

2,2w

0,2w

xy iWi

,maxarg

1x

xwT 0

xwT 1

xwT 2

59

2x

max is )(0, xyw

max is )(2, xyw

max is )(1, xyw

xWxyW

T)(

2

1

2,21,20,2

2,11,10,1

2,01,00,0 1

)(

x

x

www

www

www

xyW

1 1,0w

2,0w

0,0w

1,1w

2,1w

0,1w

1,2w

2,2w

0,2w

xy iWi

,maxarg

xwT 0

xwT 1

xwT 2

Multiclass Perceptron(2D and 3 classes)

max is )(0, xyw

max is )(2, xyw

max is )(1, xyw

2x

1x

Multiclass Perceptron(2D and 3 classes)

Example

2

1.1

1

9.446

1.44.24

5.06.32

)(xyW

Class 0

Class 1

(1.1, -2.0)

2.8

6.9

9.6

Class 2

61

Multiclass PerceptronLoss function

xwxwDxyLVx

nTtn

TjW

n

n

,

Sum over all wronglyclassified samples

Score of the wrong class

Score of the true class

,

Vx

nw

n

xDxyL

66

Multiclass Perceptron

Stochastic gradient descent (SGD)

init Wk=0, i=0DO k=k+1

FOR n = 1 to N

IF THEN /* wrongly classified sample */

UNTIL every data is well classified or k > K_MAX.

ntt

njj

xww

xww

nn

Wmaxarg nT xj

itj

67

Perceptron

Advantages:• Very simple• Does NOT assume the data follows a Gaussian distribution.• If data is linearly separable, convergence is guaranteed.

Limitations:• Zero gradient for many solutions => several “perfect solutions”• Data must be linearly separable

Will never convergeSuboptimal solutionMany “optimal”

solutions

Two famous ways of improving the Perceptron

1. New activation function + new Loss

1. New network architectureLogistic regression

Multilayer Perceptron / CNN

73

Logistic regression

74

]1,0[)( xyw

Sigmoid activation function

Logistic regression(2D, 2 classes)

New activation function: sigmoid

te

t

1

11

1w

2w

0w

1x

2x

xw T

75

xw

Tw T

exwσ)x(y

1

1

te

t

1

1

Neuron

1

1w

2w

0w

1x

2x

xwT ]1,0[)( xyw

76

Logistic regression(2D, 2 classes)

New activation function: sigmoid

1

1w

2w

0w

1x

2x

xwxy Tw

)(

77

w

1x

2x5.0)( xyw

0)( xyw

1)( xyw

Neuron

xwT ]1,0[)( xyw

Logistic regression(2D, 2 classes)

New activation function: sigmoid

Example 5.0,6.3,0.2,0.1,4.0 wxn

125.01

194.1)(

94.1

exyw

94.15.04.0*6.32 nT xw

1

6.3

5.0

2

4.0

1

Since 0.125 is lower than 0.5, is behind the plan.nx

xw T

78

Logistic regression(2D, 2 classes)

Like the Perceptron the logistic regression accomodates for K-D vectors

2x

(...)

1

Kw

1w

2w

0w

Kx

xwT

1x

79

Logistic regression(K-D, 2 classes)

xwxy Tw

)(

80

xwxy Tw

)(

What is the loss function?

2x

(...)

1

Kw

1w

2w

0w

Kx

xwT

1x

With a sigmoid, we can simulate a conditional probabilityof c1 GIVEN x

81

xcPxwxy Tw

|)( 1

2x

(...)

1

Kw

1w

2w

0w

Kx

xwT

1x

With a sigmoid, we can simulate a conditional probabilityof c1 GIVEN x

82

xcPxwxy Tw

|)( 1

)(1,|0 xyxwcP w

2x

(...)

1

Kw

1w

2w

0w

Kx

xwT

1x

n

N

nnnw

w xtxywd

DxydL

1

,

Cost function is –ln of the likelihood

We can also show that

2 Class Cross entropy

As opposed to the Perceptronthe gradient does not dependon the wrongly classified samples

nwn

N

nnwnw xytxytDxyL

1ln1ln,1

Logistic Network

Advantages:• More stable than the Perceptron• More effective when the data is non separable

Per

cept

ron

Log

istic

net

87

And for K>2 classes?New activation function : Softmax

exp

normexp

exp

)(0

xyw

c

xw

xw

w Tc

Ti

i e

exy

)(

xWcP

,|0

)(1

xyw

xWcP

,|1

)(2

xyw

xWcP

,|2

1 1,0w

2,0w

0,0w

1,1w

2,1w

0,1w

1,2w

2,2w

0,2w

xwT 0

xwT 1

xwT 2

88

2x

1x

Softmax

exp

normexp

exp

)(0

xyw

xWcP

,|0

)(1

xyw

xWcP

,|1

)(2

xyw

xWcP

,|2

1 1,0w

2,0w

0,0w

1,1w

2,1w

0,1w

1,2w

2,2w

0,2w

xwT 0

xwT 1

xwT 2

And for K>2 classes?New activation function : Softmax

c

xw

xw

w Tc

Ti

i e

exy

)(

2x

1x

'airplane' t 1000000000

'automobile' t 0100000000

'bird' t 0010000000

'cat' t 0001000000

'deer' t 0000100000

'dog' t 0000010000

'frog' t

0000001000

'horse' t 0000000100

'ship' t 0000000010

'truck' t 0000000001

Class labels : one-hot vectors

And for K>2 classes?

Cifar10

Cross entropy Loss

K>2 classes

91

N

n

K

knWknW xytDxyL

k1 1

ln,

K-Class cross entropy loss

92

N

nknnWn txyxL

1

N

n

K

knWknW xytDxyL

k1 1

ln,

Regularization

31,31,31

0,0,1

0.1,0.1,0.1

2

1

T

T

w

w

x

Different weights may give the same score

121 xwxw TT Maximum a Solution:

Maximum a posteriori

93

,minarg DxyL wW

Loss function

Regularization

Constant

21

Wou WRIn general L1 or L2

WR

Maximum a posterioriRegularization

94

Wow! Loooots of information!

Lets recap…

95

Neural networks2 classes

xcPxyw

|)( 1

Sign activation Sigmoid activation(...)

1

1w

2w

Kw

0w

sgxw T

1,1 xyw

xw T

96

Perceptron Logistic regression

2x

Kx

1x

(...)

1

1w

2w

Kw

0w

2x

Kx

1x

K-Class Neural networks

Softmax activation

exp

normexp

exp

)(0

xyw

xcP

|0

)(1

xyw

xcP

|1

)(2

xyw

xcP

|2

1,0w

2,0w

0,0w

1,1w

2,1w

0,1w

1,2w

2,2w

0,2w

xwT 0

xwT 1

xwT 2

xyiW

i

maxarg

1 1,0w

2,0w

0,0w

1,1w

2,1w

0,1w

1,2w

2,2w

0,2w

xT 0w

xT 1w

xT 2w

Perceptron

Logistic regression

2x

1x

1

2x

1x

Loss functions

samples classified wrongly ofset theis V ere wh, T

Vx

nnw

n

xwtDxyL

N

nnnw xwtDxyL

1

T,0max,

N

nnnw xwtDxyL

1

T1,0max,

“Hinge Loss” or “SVM” Loss

2 classes

Cross entropy loss

98

nwn

N

nnwnw xytxytDxyL

1ln1ln,1

Loss functions

K classes

Cross entropy loss with a Softmax

99

samples classified wrongly ofset thes V where, ixwxwDxyLVx

nTtn

Tjw

n

n

N

n jn

Ttn

Tjw xwxw,DxyL

n1

10max,

N

n jn

Ttn

Tjw xwxw,DxyL

n1

0max,

N

n

K

knWknw xytDxyL

k1 1

ln,

“Hinge Loss” or “SVM” Loss

Maximum a posteriori

Loss function

Regularization

Constante

,,1

N

nnnWw txylDxyL

21

W or WWR

WR

100

Optimisation

Lkkk ][][]1[ ww

Gradient of the loss function

learning rate

Stochastic gradient descent (SGD)

Initk=0DO k=k+1

FOR n = 1 to N

UNTIL every data is well classified or k== MAX_ITER

nk xLww

][

101

w

Now, lets goDEEPER

Three classical solutions

1. Acquire more data

2. Use a non-linear classifier

3. Transform the data

Non-linearly separable training data

Three classical solutions

1. Acquire more data

2. Use a non-linear classifier

3. Transform the data

Non-linearly separable training data

2D, 2Classes, Linear logistic regression

105

1x

1

2w

1w

0w

2x

Input layer(3 “neurons”)

Output layer(1 neuron with sigmoid)

Rxwσ)x(y Tw

xwT

Let’s add 3 neurons

1 1,0w

2,0w

0,0w

1,1w

2,1w

0,1w

1,2w

2,2w

0,2w

Input layer(3 “neurons”)

First layer(3 neurons)

xwT 0

xwT 1

xwT 2

RxwT

0

RxwT

1

RxwT

22x

1x

NOTE: The output of the first layer is a vector of 3 real values

1,0w

2,0w

0,0w

1,1w

2,1w

0,1w

1,2w

2,2w

0,2w

xwT 0

xwT 1

xwT 2

RxwT

0

RxwT

1

RxwT

2

3

2

1

2,21,20,2

2,11,10,1

2,01,00,0 1

R

x

x

www

www

www

1

2x

1x

[0]W x

NOTE: The output of the first layer is a vector of 3 real values

1,0w

2,0w

0,0w

1,1w

2,1w

0,1w

1,2w

2,2w

0,2w

xwT 0

xwT 1

xwT 2

RxwT

0

RxwT

1

RxwT

2

1

2x

1x

If we want a 2-class Classification via a logistic regression (a cross entropy loss) we must add an output neuron.

Input layer(3 « neurons »)

hidden layer(3 neurons)

Output layer(1 neuron)

2-D, 2-Class, 1 hidden layer

1

]0[2,0w

]0[2,2w

]0[2,1w

]0[0,0w

]0[0,2w

]0[0,1w

]0[1,0w

]0[1,2w

]0[1,1w

]1[0w

]1[1w

]1[2w

xw T ]0[0

xw T ]1[1

xw T ]2[2

xWw T ]0[]1[ RxWwxy TW

]0[]1[

2x

1x

1

Visualsimplification

[0]W

1

]0[2,0w

]0[2,2w

]0[2,1w

]0[0,0w

]0[0,2w

]0[0,1w

]0[1,0w

]0[1,2w

]0[1,1w

]1[0w

]1[1w

]1[2w

xw T ]0[0

xw T ]1[1

xw T ]2[2

xWw T ]0[]1[ RxWwxy TW

]0[]1[

2x

1x

1x

0x

]1[w

xWwxy TW

]0[]1[

2-D, 2-Class, 1 hidden layer

Inputlayer

Hiddenlayer

Outputlayer

1

bias neuron.

This network contains a total of 13 parameters

Input layerHidden layer

3x3 1x4

Hidden layerOutput layer

1

[0]W

1x

0x

]1[w

xWwxy TW

]0[]1[

2-D, 2-Class, 1 hidden layer

112

1

1

Increasing the number of neurons = increasing the capacity of the model

This network has 5x3+1x6=21 parameters

2x

1x

Inputlayer

Hiddenlayer

Outputlayer

2-D, 2-Class, 1 hidden layer

xWwxy TW

]0[]1[

No hidden neuron

http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

12 hidden neurons 60 hidden neurons

Nb neurons VS Capacity

Linear classificationUnderfitting(low capacity)

Non linear classificationGood result

(good capacity)

Non linear classificationOver fitting

(too large capacity)

114

1

Increasing the dimensionality of the data = more columns in

This network has 5x(k+1)+1x6 parameters

kD, 2Classes, 1 hidden layer

1x

1

2x

[0]W

(...)

kx

1kx

Inputlayer

Hiddenlayer

Outputlayer

xWwxy TW

]0[]1[

115

1

Adding an hidden layer = Adding a matrix multiplication

This network has 5x(k+1)+6x3 + 1x4 parameters

kD, 2Classes, 2 hidden layers

[0]W

1

[1]W

Inputlayer

HiddenLayer 1

Outputlayer

HiddenLayer 2

1x

1

2x

(...)

kx

1kx

xWWwxy TW

]0[]1[]2[

Tw ]2[4]2[

63]1[

15]0[

Rw

RW

RW k

kD, 2 Classes, 4 hidden layer network

116

1

kx

1

[0]W

1

[1]W

(...)

1

1

[3]W

[2]W

This network has 5x(k+1)+6x3 + 4x4+7x5+1x8 parameters

8]4[

57]3[

44]2[

63]1[

15]0[

Rw

RW

RW

RW

RW k

xWWWWwxy TW

]0[]1[]2[]3[]4[

]4[w

Inputlayer

HiddenLayer 1

Outputlayer

HiddenLayer 2

HiddenLayer 3

HiddenLayer 4

1x

2x

3x

kD, 2 Classes, 4 hidden layer network

1

1x

kx

1

2x

[0]W

1

[1]W

(...)

1

1

[3]W

[2]W

8]4[

57]3[

44]2[

63]1[

15]0[

Rw

RW

RW

RW

RW k

xWWWWwxy TW

]0[]1[]2[]3[]4[

]4[w

Inputlayer

HiddenLayer 1

Outputlayer

HiddenLayer 2

HiddenLayer 3

HiddenLayer 4

NOTE : More hidden layers = deeper network = more capacity.

3x

Multilayer Perceptron

1

1x

2x

1

1

(...)

1

1

Kx

xyW

xf

x

3x

Example

1

1x

2x

1

1

Input data

x

Output of the last layer Output of the network

)(xyW

A K-Class neural networkhas K output neurons.

kD, 4 Classes, 4 hidden layer network

1

kx

1

[0]W

1

[1]W

[4]W

(...)

1

1

[3]W

[2]W

,1Wy x

,0Wy x

,3Wy x

,2Wy x

[4] [3] [2] [1] [0]Wy x W W W W W x

[4]0W

[4]1W

[4]2W

[4]3W

48]4[

57]3[

44]2[

63]1[

15]0[

RW

RW

RW

RW

RW kInputlayer

HiddenLayer 1

Outputlayer

HiddenLayer 2

HiddenLayer 3

HiddenLayer 4

1x

2x

3x

kD, 4 Classes, 4 hidden layer network

1

kx

1

[0]W

1

[1]W

[4]W

(...)

1

1

[3]W

[2]W

,1Wy x

,0Wy x

,3Wy x

,2Wy x

[4] [3] [2] [1] [0]Wy x W W W W W x

Hinge loss

[4]0W

[4]1W

[4]2W

[4]3W

Inputlayer

HiddenLayer 1

Outputlayer

HiddenLayer 2

HiddenLayer 3

HiddenLayer 4

1x

2x

3x

Softmax

1

kx

1

[0]W

1

[1]W

[4]W

(...)

1

1

[3]W

[2]W exp

exp

exp

exp

,1Wy xN

ORM

,0Wy x

,3Wy x

,2Wy x

[4] [3] [2] [1] [0]softmaxWy x W W W W W x

Cross entropy

kD, 4 Classes, 4 hidden layer networkInputlayer

HiddenLayer 1

Outputlayer

HiddenLayer 2

HiddenLayer 3

HiddenLayer 4

1x

2x

3x

How to make a prediction?

126

Forward pass

127

1

1x

2x

1

1

x

Forward pass

128

1

1

1

xW]0[

1x

2x

Forward pass

129

1

1

1

xW]0[

1x

2x

Forward pass

130

1

1

1

xWW]0[]1[

1x

2x

Forward pass

131

1

1

1

xWW]0[]1[

1x

2x

Forward pass

132

1

1

1

xWWwT ]0[]1[

1x

2x

Forward pass

133

1

1

1

xWWwxy TW

]0[]1[

1x

2x

Forward pass

134

xytxyttxyl WWW

1ln)1(ln,

1

1

1

txyl W ,

1x

2x

How to optimize the network?

135

0- From

Choose a regularization function

,minarg1

WRtxylWN

nnnW

W

21

or WWWR

How to optimize the network?

1- Choose a loss for example

Hinge lossCross entropy

Do not forget to adjust the output layer withthe loss you have choosen.

cross entropy => Softmax

nnW txyl ,

How to optimize the network?

137

2- Compute the gradient of the loss with respect to each parameter

and launch a gradient descent algorithm to update the parameters.

][

,

1

,

cba

N

nnnW

W

WRtxyl

][

,

1][,

][,

,

cba

N

nnnW

cba

cba W

WRtxyl

WW

How to optimize the network?

138

][

,

1

,

cba

N

nnnW

W

WRtxyl

Backpropagation

1

0x

1x

1

1

xWWwxyW

]0[]1[]2[

139

xytxyttxyl WWW

1ln)1(ln,

1

0x

1x

1

1

txyl W ,

xWWwxy TW

]0[]1[

140

1

0x

1x

1

1

txyl

Fxy

DwF

CD

BWC

AB

xWA

W

W

T

,

]1[

]0[

A

141

1

0x

1x

1

1

txyl

Fxy

DwF

CD

BWC

AB

xWA

W

W

T

,

]1[

]0[

A B

142

1

0x

1x

1

1

txyl

Fxy

DwF

CD

BWC

AB

xWA

W

W

T

,

]1[

]0[

A B

C

143

1

0x

1x

1

1

txyl

Fxy

DwF

CD

BWC

AB

xWA

W

W

T

,

]1[

]0[

A B

CD

144

1

0x

1x

1

1

txyl

Fxy

DwE

CD

BWC

AB

xWA

W

W

T

,

]1[

]0[

A B

CD

E

145

1

0x

1x

1

1

txyl

Exy

DwE

CD

BWC

AB

xWA

W

W

T

,

]1[

]0[

xWWwxy TW

]0[]1[

A B

CD

E

146

1

0x

1x

1

1

txyl W ,

txyl

Exy

DwE

CD

BWC

AB

xWA

W

W

T

,

]1[

]0[

xWWwxy TW

]0[]1[

A B

CD

E

147

][

1

,

l

N

nnnW

W

txyl

Chain rule

Chain rule recap

148

xxv

vvu

uuf

/1)(

2)(

)( 2

2

122

xu

x

v

v

u

u

f

x

f

?

x

f

149

1

0x

1x

1

1

txyl

Exy

DwE

CD

BWC

AB

xWA

W

W

T

,

]1[

]0[

A B

CD

E

]0[]0[

, ,

W

A

A

B

B

C

C

D

D

E

E

xy

xy

txyl

W

txyl W

W

WW

150

1

0x

1x

1

1

txyl W ,

]0[]0[

, ,

W

B

A

B

B

C

C

D

D

E

E

xy

xy

txyl

W

txyl W

W

WW

Back propagation

A B

CD

E xyW

Activation functions

151

Activation functions

Sigmoïde

1

1 xx

e

3 Problems :

• Gradient saturates when input is large

• Not zero centered

• exp() is an expensive operation

Activation functions

Tanh(x)

• Output is zero-centered• Small gradient when input is large

[LeCun et al., 1991]

Activation functions ReLU max 0,x x

ReLU(x)(Rectified Linear Unit)

• Large gradient for x>0• Super fast

• Output non centered at zero• No gradient when x<0?

[Krizhevsky et al., 2012]

Activation functions LReLU max 0.01 ,x x x

Leaky ReLU(x)

• no gradient saturation • Super fast

• 0.01 is an hyperparameter

[Mass et al., 2013][He et al., 2015]

Activation functions PReLU max ,x x x

Parametric ReLU(x)

• no gradient saturation • Super fast

• learn with back prop

[Mass et al., 2013][He et al., 2015]

In practice

• By default, people use ReLU.

• Try Leaky ReLU / PReLU / ELU

• Try tanh but might be sub-optimal

• Do not use sigmoïde except at the output of a 2 class net.

How to classify an image?

https://ml4a.github.io/ml4a/neural_networks/

Pixel 1

Pixel 2

Pixel 3

Pixel 4

Pixel 5

Pixel 6

Pixel 7

Pixel 8

Pixel 9

Pixel 10

Pixel 11

Pixel 12

Pixel 13

Pixel 14

Biashttps://ml4a.github.io/ml4a/neural_networks/

How to classify an image?

Many parameters(7850 in Layer 1)

Pixel 1

Pixel 2

Pixel 3

Pixel 4

Pixel 5

Pixel 6

Pixel 7

Pixel 8

Pixel 9

Pixel 10

Pixel 11

Pixel 12

Pixel 13

Pixel 14

biashttps://ml4a.github.io/ml4a/neural_networks/

Pixel 1

Pixel 2

Pixel 3

Pixel 4

Pixel 5

Pixel 6

Pixel 7

Pixel 8

Pixel 9

Pixel 10

Pixel 11

Pixel 12

Pixel 13

Pixel 14

Pixel 65536https://ml4a.github.io/ml4a/neural_networks/

256x256

Too many parameters(655,370 in Layer 1)

Pixel 1

Pixel 2

Pixel 3

Pixel 4

Pixel 5

Pixel 6

Pixel 7

Pixel 8

Pixel 9

Pixel 10

Pixel 11

Pixel 12

Pixel 13

Pixel 14

Pixel 16Mhttps://ml4a.github.io/ml4a/neural_networks/

256x256x256

Waaay too many parameters(160M in Layer 1)

Full connections are too many

(...)

(...) (...)

1,01 xF

S

S

S

S

1,02 xF

1,03 xF

1,04 xF

150-D input vector with 150 neurons in Layer 1 => 22,500 parameters!!

No full connection

(...) (...) (...)

1,01 xF

S

S

S

S

1,02 xF

1,03 xF

1,04 xF

150-D input vector with 150 neurons in Layer 1 => 450 parameters!!

Share weights

150-D input vector with 150 neurons in Layer 1 => 3 parameters!!

(...)

(...)

(...)

1,01 xF

S

S

S

S

1,02 xF

1,03 xF

1,04 xF

(...)

10w

11w

12w

11w

10w

12w

11w

12w

10w

(...)

20w

21w

22w

22w

21w

20w

21w

22w

20w

1- Learning convolution filters! 2- Small number of parameters = can make it deep!

(...)(...)

10w

11w

12w

(...)

(...)(...)

10w

11w

12w

(...)

3.2

7.1

0.4

8.2

7.5

4.4

8.2

7.5

4.4

(...)(...)

1.02.0

3.0

(...)

3.2

7.1

0.4

8.2

7.5

4.4

8.2

7.5

4.4

(...)(...)

1.02.0

3.0

(...)

3.2

7.1

0.4

8.2

7.5

4.4

25.02.134.23.

8.2

7.5

4.4

(...)(...)

1.02.0

3.0

(...)

3.2

7.1

0.4

8.2

7.5

4.4

25.0

45.084.8.17.

8.2

7.5

4.4

(...)(...)

1.02.0

3.0

(...)

3.2

7.1

0.4

8.2

7.5

4.4

25.0

45.0

50.017.56.4.

8.2

7.5

4.4

(...)(...)

1.02.0

3.0

(...)

3.2

7.1

0.4

8.2

7.5

4.4

25.0

45.0

50.0

8.2

7.5

4.4

50.0

Convolution!

Each neuron of layer 1 is connected to 3x3 pixels, layer 1 has 9 parameters!!

Feature map

]0[WxF

Convolution operation

]0[0Wx ]0[

1Wx

]0[2Wx

]0[3Wx ]0[

4Wx

5-feature mapconvolution layer

K-feature mapconvolution layer

Conv layer 1POOLING

LAYER

Pooling layer

181

Goals

• Reduce the spatial resolution of featuremaps

• Lower memory and computation requirements

• Provide partial invariance to position, scale and rotation

(stride = 1) (stride = 1)

Images: https://pythonmachinelearning.pro/introduction-to-convolutional-neural-networks-for-vision-tasks/

Conv layer 1 Pool layer 1 Conv layer 2 Pool layer 2

Conv layer 1 Pool layer 1 Conv layer 2 Pool layer 2

(…)

Fully connected layers

xyW

2 Class CNN

xytxyttxyl WWW

1ln)1(ln,

Conv layer 1 Pool layer 1 Conv layer 2 Pool layer 2

(…)

Fully connected layers

K Class CNN

xytxyttxyl WWW

1ln)1(ln,

exp

normexp

exp

)(0

xyw

)(1

xyw

)(2

xyw

SOFTMAX

185

S. Banerjee, S. Mitra, A. Sharma, and B. U Shankar, A CADe System for Gliomas in Brain MRI using Convolutional Neural Networks, arXiv:1806.07589v, 2018

Nice example from the litterature

Learn image-based caracteristics

http://web.eecs.umich.edu/~honglak/icml09-ConvolutionalDeepBeliefNetworks.pdf

Batch processing

5.0,6.3,0.2

0.1,4.0

w

x

1

xw T

188

1x

2x

5.0,6.3,0.2

0.1,4.0

w

x

1

xw T

189

4.0

0.1

125.01

194.1)(

94.1

exyw

94.15.04.0*6.32 xwT 0.2

6.3

5.0

5.0,6.3,0.2

0.1,4.0

w

x

1

xw T

190

4.0

0.1

125.0

1

4.0

1

5.0,6.3,0.2)(

aw xy

0.2

6.3

5.0

5.0,6.3,0.2

0.3,1.2,0.1,4.0

w

xx ba

1

xw T4.0

0.1

125.0

1

4.0

1

5.0,6.3,0.2)(

aw xy

1

xw T1.2

0.3

99.0

0.3

1.2

1

5.0,6.3,0.2)(

bw xy

0.2

6.3

5.0

0.2

6.3

5.0

Mini-batch processing

192

xw T

1

4.0

0.1

99.0,125.0

0.31

1.24.0

11

5.0,6.3,0.2)(

aw xy

1

1.2

0.3

0.2

6.3

5.0

Mini-batch processing

193

1

xw T4.0

0.1

99.0,125.0,2.0,89.0)( aw xy

1

1.2

0.3

1

5.0

2.0

1

1.1

1.1

0.2

6.3

5.0

Mini-batch processing

HorseDogTruck

Mini-batch processing

Mini-batch of4 images 4 predictions

HorseDogTruck

Classical applications of ConvNets

Classification.

197

Classical applications of ConvNets

Classification.

S. Banerjee, S. Mitra, A. Sharma, and B. U Shankar, A CADe System for Gliomas in Brain MRI using Convolutional Neural Networks, arXiv:1806.07589v, 2018

Image segmentation

Classical applications of ConvNets

Image segmentation

Classical applications of ConvNets

Tran, P. V., 2016. A fully convolutional neural network for cardiac segmentation in short-axis MRI. arXiv:1604.00494.

Image segmentation

Classical applications of ConvNets

Fang Liu, Zhaoye Zhou, +3 authors, Deep convolutional neural network and 3D deformable approach for tissue segmentation in musculoskeletal magnetic resonance imaging. in Magnetic resonance in medicine 2018 DOI:10.1002/mrm.26841

Image segmentation

Classical applications of ConvNets

Havaei M., Davy A., Warde-Farley D., Biard A., Courville A., Bengio Y., Pal C., Jodoin P-M, Larochelle H. (2017)Brain Tumor Segmentation with Deep Neural Networks, Medical Image Analysis, Vol 35, 18-31

Localization

Classical applications of ConvNets

Localization

Classical applications of ConvNets

S. Banerjee, S. Mitra, A. Sharma, and B. U Shankar, A CADe System for Gliomas in Brain MRI using Convolutional Neural Networks, arXiv:1806.07589v, 2018

Conclusion

• Linear classification (1 neuron network)

• Logistic regression

• Multilayer perceptron

• Conv Nets

• Many buzz words

– Softmax

– Loss

– Batch

– Gradient descent

– Etc.

204