Tips for Deep Learning - 國立臺灣大學

52
Tips for Deep Learning

Transcript of Tips for Deep Learning - 國立臺灣大學

Page 1: Tips for Deep Learning - 國立臺灣大學

Tips for Deep Learning

Page 2: Tips for Deep Learning - 國立臺灣大學

Neural Network

Good Results on Testing Data?

Good Results on Training Data?

Step 3: pick the best function

Step 2: goodness of function

Step 1: define a set of function

YES

YES

NO

NO

Overfitting!

Recipe of Deep Learning

Page 3: Tips for Deep Learning - 國立臺灣大學

Do not always blame Overfitting

Deep Residual Learning for Image Recognition

http://arxiv.org/abs/1512.03385

Testing Data

Overfitting?

Training Data

Not well trained

Page 4: Tips for Deep Learning - 國立臺灣大學

Neural Network

Good Results on Testing Data?

Good Results on Training Data?

YES

YES

Recipe of Deep Learning

Different approaches for different problems.

e.g. dropout for good results on testing data

Page 5: Tips for Deep Learning - 國立臺灣大學

Good Results on Testing Data?

Good Results on Training Data?

YES

YES

Recipe of Deep Learning

New activation function

Adaptive Learning Rate

Early Stopping

Regularization

Dropout

Page 6: Tips for Deep Learning - 國立臺灣大學

Hard to get the power of Deep …

Deeper usually does not imply better.

Results on Training Data

Page 7: Tips for Deep Learning - 國立臺灣大學

Vanishing Gradient Problem

Larger gradients

Almost random Already converge

based on random!?

Learn very slow Learn very fast

1x

2x

……

Nx

……

……

……

……

……

……

……

y1

y2

yM

Smaller gradients

Page 8: Tips for Deep Learning - 國立臺灣大學

Vanishing Gradient Problem

1x

2x

……

Nx

……

……

……

……

……

……

……

𝑦1

𝑦2

𝑦𝑀

……

ො𝑦1

ො𝑦2

ො𝑦𝑀

𝑙

Intuitive way to compute the derivatives …

𝜕𝑙

𝜕𝑤=?

+∆𝑤

+∆𝑙

∆𝑙

∆𝑤

Smaller gradients

Large input

Small output

Page 9: Tips for Deep Learning - 國立臺灣大學

ReLU

• Rectified Linear Unit (ReLU)

Reason:

1. Fast to compute

2. Biological reason

3. Infinite sigmoid with different biases

4. Vanishing gradient problem

𝑧

𝑎

𝑎 = 𝑧

𝑎 = 0

𝜎 𝑧

[Xavier Glorot, AISTATS’11][Andrew L. Maas, ICML’13][Kaiming He, arXiv’15]

Page 10: Tips for Deep Learning - 國立臺灣大學

ReLU

1x

2x

1y

2y

0

0

0

0

𝑧

𝑎

𝑎 = 𝑧

𝑎 = 0

Page 11: Tips for Deep Learning - 國立臺灣大學

ReLU

1x

2x

1y

2y

A Thinner linear network

Do not have smaller gradients

𝑧

𝑎

𝑎 = 𝑧

𝑎 = 0

Page 12: Tips for Deep Learning - 國立臺灣大學

ReLU - variant

𝑧

𝑎

𝑎 = 𝑧

𝑎 = 0.01𝑧

𝐿𝑒𝑎𝑘𝑦 𝑅𝑒𝐿𝑈

𝑧

𝑎

𝑎 = 𝑧

𝑎 = 𝛼𝑧

𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑐 𝑅𝑒𝐿𝑈

α also learned by gradient descent

Page 13: Tips for Deep Learning - 國立臺灣大學

Maxout

• Learnable activation function [Ian J. Goodfellow, ICML’13]

Max

1x

2x

Input

Max

+ 5

+ 7

+ −1

+ 1

7

1

Max

Max

+ 1

+ 2

+ 4

+ 3

2

4

ReLU is a special cases of Maxout

You can have more than 2 elements in a group.

neuron

Page 14: Tips for Deep Learning - 國立臺灣大學

Maxout

Max

x

1

Input+ 𝑧1

+ 𝑧2

𝑎

𝑚𝑎𝑥 𝑧1 , 𝑧2

𝑤𝑏

0

0

𝑥

𝑧 = 𝑤𝑥 + 𝑏𝑎

x

1

Input ReLU𝑧

𝑤

𝑏

𝑎

𝑥

𝑧1 = 𝑤𝑥 + 𝑏

𝑎

𝑧2 =0

ReLU is a special cases of Maxout

Page 15: Tips for Deep Learning - 國立臺灣大學

Maxout

Max

x

1

Input+ 𝑧1

+ 𝑧2

𝑎

𝑚𝑎𝑥 𝑧1 , 𝑧2

𝑤𝑏

𝑤′

𝑏′

𝑥

𝑧 = 𝑤𝑥 + 𝑏𝑎

x

1

Input ReLU𝑧

𝑤

𝑏

𝑎

𝑥

𝑧1 = 𝑤𝑥 + 𝑏

𝑎

𝑧2 = 𝑤′𝑥 + 𝑏′

Learnable Activation Function

More than ReLU

Page 16: Tips for Deep Learning - 國立臺灣大學

Maxout

• Learnable activation function [Ian J. Goodfellow, ICML’13]

• Activation function in maxout network can be any piecewise linear convex function

• How many pieces depending on how many elements in a group

2 elements in a group 3 elements in a group

Page 17: Tips for Deep Learning - 國立臺灣大學

Maxout - Training

• Given a training data x, we know which z would be the max

Max

1x

2x

Input

Max𝑥

+ 𝑧11

+ 𝑧21

+ 𝑧31

+ 𝑧41

𝑎11

𝑎21

Max

Max

+ 𝑧12

+ 𝑧22

+ 𝑧32

+ 𝑧42

𝑎12

𝑎22

𝑎1 𝑎2

𝑚𝑎𝑥 𝑧11, 𝑧2

1

Page 18: Tips for Deep Learning - 國立臺灣大學

Maxout - Training

• Given a training data x, we know which z would be the max

• Train this thin and linear network

1x

2x

Input

𝑥

+ 𝑧11

+ 𝑧21

+ 𝑧31

+ 𝑧41

𝑎11

𝑎21

+ 𝑧12

+ 𝑧22

+ 𝑧32

+ 𝑧42

𝑎12

𝑎22

𝑎1 𝑎2

Different thin and linear network for different examples

Page 19: Tips for Deep Learning - 國立臺灣大學

Good Results on Testing Data?

Good Results on Training Data?

YES

YES

Recipe of Deep Learning

New activation function

Adaptive Learning Rate

Early Stopping

Regularization

Dropout

Page 20: Tips for Deep Learning - 國立臺灣大學

Review

𝑤1

𝑤2

Larger Learning Rate

Smaller Learning Rate

Adagrad

𝑤𝑡+1 ← 𝑤𝑡 −𝜂

σ𝑖=0𝑡 𝑔𝑖 2

𝑔𝑡

Use first derivative to estimate second derivative

Page 21: Tips for Deep Learning - 國立臺灣大學

RMSProp

𝑤1

𝑤2

Error Surface can be very complex when training NN.

Larger Learning Rate

Smaller Learning Rate

Page 22: Tips for Deep Learning - 國立臺灣大學

RMSProp

𝑤1 ← 𝑤0 −𝜂

𝜎0𝑔0

……

𝑤2 ← 𝑤1 −𝜂

𝜎1𝑔1

𝑤𝑡+1 ← 𝑤𝑡 −𝜂

𝜎𝑡𝑔𝑡

𝜎0 = 𝑔0

𝜎1 = 𝛼 𝜎0 2 + 1 − 𝛼 𝑔1 2

𝑤3 ← 𝑤2 −𝜂

𝜎2𝑔2 𝜎2 = 𝛼 𝜎1 2 + 1 − 𝛼 𝑔2 2

𝜎𝑡 = 𝛼 𝜎𝑡−1 2 + 1 − 𝛼 𝑔𝑡 2

Root Mean Square of the gradients with previous gradients being decayed

Page 23: Tips for Deep Learning - 國立臺灣大學

Hard to find optimal network parameters

TotalLoss

The value of a network parameter w

Very slow at the plateau

Stuck at local minima

𝜕𝐿 ∕ 𝜕𝑤= 0

Stuck at saddle point

𝜕𝐿 ∕ 𝜕𝑤= 0

𝜕𝐿 ∕ 𝜕𝑤≈ 0

Page 24: Tips for Deep Learning - 國立臺灣大學

In physical world ……

• Momentum

How about put this phenomenon in gradient descent?

Page 25: Tips for Deep Learning - 國立臺灣大學

Review: Vanilla Gradient Descent

Start at position 𝜃0

Compute gradient at 𝜃0

Move to 𝜃1 = 𝜃0 - η𝛻𝐿 𝜃0

Compute gradient at 𝜃1

Move to 𝜃2 = 𝜃1 – η𝛻𝐿 𝜃1Movement

Gradient

……

𝜃0

𝜃1

𝜃2

𝜃3

𝛻𝐿 𝜃0

𝛻𝐿 𝜃1

𝛻𝐿 𝜃2

𝛻𝐿 𝜃3

Stop until 𝛻𝐿 𝜃𝑡 ≈ 0

Page 26: Tips for Deep Learning - 國立臺灣大學

MomentumStart at point 𝜃0

Compute gradient at 𝜃0

Move to 𝜃1 = 𝜃0 + v1

Compute gradient at 𝜃1

Movement v0=0

Movement v1 = λv0 - η𝛻𝐿 𝜃0

Movement v2 = λv1 - η𝛻𝐿 𝜃1

Move to 𝜃2 = 𝜃1 + v2Movement

Gradient

𝜃0

𝜃1

𝜃2

𝜃3

𝛻𝐿 𝜃0𝛻𝐿 𝜃1

𝛻𝐿 𝜃2

𝛻𝐿 𝜃3 Movement not just based on gradient, but previous movement.

Movementof last step

Movement: movement of last step minus gradient at present

Page 27: Tips for Deep Learning - 國立臺灣大學

Momentum

vi is actually the weighted sum of all the previous gradient:

𝛻𝐿 𝜃0 ,𝛻𝐿 𝜃1 , … 𝛻𝐿 𝜃𝑖−1

v0 = 0

v1 = - η𝛻𝐿 𝜃0

v2 = - λ η𝛻𝐿 𝜃0 - η𝛻𝐿 𝜃1

……

Start at point 𝜃0

Compute gradient at 𝜃0

Move to 𝜃1 = 𝜃0 + v1

Compute gradient at 𝜃1

Movement v0=0

Movement v1 = λv0 - η𝛻𝐿 𝜃0

Movement v2 = λv1 - η𝛻𝐿 𝜃1

Move to 𝜃2 = 𝜃1 + v2

Movement not just based on gradient, but previous movement

Movement: movement of last step minus gradient at present

Page 28: Tips for Deep Learning - 國立臺灣大學

Movement = Negative of 𝜕𝐿∕𝜕𝑤 + Momentum

Momentum

cost

𝜕𝐿∕𝜕𝑤 = 0

Still not guarantee reaching global minima, but give some hope ……

Negative of 𝜕𝐿 ∕ 𝜕𝑤

Momentum

Real Movement

Page 29: Tips for Deep Learning - 國立臺灣大學

Adam RMSProp + Momentum

for momentumfor RMSprop

Page 30: Tips for Deep Learning - 國立臺灣大學

Good Results on Testing Data?

Good Results on Training Data?

YES

YES

Recipe of Deep Learning

New activation function

Adaptive Learning Rate

Early Stopping

Regularization

Dropout

Page 31: Tips for Deep Learning - 國立臺灣大學

Early Stopping

Epochs

TotalLoss

Training set

Testing set

Stop at here

Validation set

http://keras.io/getting-started/faq/#how-can-i-interrupt-training-when-the-validation-loss-isnt-decreasing-anymoreKeras:

Page 32: Tips for Deep Learning - 國立臺灣大學

Good Results on Testing Data?

Good Results on Training Data?

YES

YES

Recipe of Deep Learning

New activation function

Adaptive Learning Rate

Early Stopping

Regularization

Dropout

Page 33: Tips for Deep Learning - 國立臺灣大學

Regularization

• New loss function to be minimized

• Find a set of weight not only minimizing original cost but also close to zero

22

1L L

Original loss(e.g. minimize square error, cross entropy …)

,, 21 ww

(usually not consider biases)

2

2

2

12ww

Regularization term

L2 regularization:

Page 34: Tips for Deep Learning - 國立臺灣大學

Regularization

• New loss function to be minimized

Gradient: www

LL

Update:w

ww tt

L1

tt w

ww

L

w

wt

L1

Closer to zero

22

1L L

Weight Decay

2

2

2

12ww

L2 regularization:

Page 35: Tips for Deep Learning - 國立臺灣大學

Regularization

• New loss function to be minimized

www

sgnLL

Update:

www tt

L1

tt w

ww sgn

L

tt ww

w sgnL

Always delete

12

1L L

211ww

L1 regularization:

w

wt

L1 …… L2

Page 36: Tips for Deep Learning - 國立臺灣大學

Regularization - Weight Decay

• Our brain prunes out the useless link between neurons.

Doing the same thing to machine’s brain improves the performance.

Page 37: Tips for Deep Learning - 國立臺灣大學

Good Results on Testing Data?

Good Results on Training Data?

YES

YES

Recipe of Deep Learning

New activation function

Adaptive Learning Rate

Early Stopping

Regularization

Dropout

Page 38: Tips for Deep Learning - 國立臺灣大學

DropoutTraining:

➢ Each time before updating the parameters

Each neuron has p% to dropout

Page 39: Tips for Deep Learning - 國立臺灣大學

DropoutTraining:

➢ Each time before updating the parameters

Each neuron has p% to dropout

Using the new network for training

The structure of the network is changed.

Thinner!

For each mini-batch, we resample the dropout neurons

Page 40: Tips for Deep Learning - 國立臺灣大學

DropoutTesting:

➢ No dropout

If the dropout rate at training is p%, all the weights times 1-p%

Assume that the dropout rate is 50%. If a weight w = 1 by training, set 𝑤 = 0.5 for testing.

Page 41: Tips for Deep Learning - 國立臺灣大學

Dropout- Intuitive Reason

Training

Testing

Dropout (腳上綁重物)

No dropout(拿下重物後就變很強)

Page 42: Tips for Deep Learning - 國立臺灣大學

Dropout - Intuitive Reason

➢ When teams up, if everyone expect the partner will do the work, nothing will be done finally.

➢ However, if you know your partner will dropout, you will do better.

我的 partner 會擺爛,所以我要好好做

➢ When testing, no one dropout actually, so obtaining good results eventually.

Page 43: Tips for Deep Learning - 國立臺灣大學

Dropout - Intuitive Reason

• Why the weights should multiply (1-p)% (dropout rate) when testing?

Training of Dropout Testing of Dropout

𝑤1

𝑤2

𝑤3

𝑤4

𝑧

𝑤1

𝑤2

𝑤3

𝑤4

𝑧′

Assume dropout rate is 50%

0.5 ×

0.5 ×

0.5 ×

0.5 ×

No dropout

Weights from training

𝑧′ ≈ 2𝑧

𝑧′ ≈ 𝑧

Weights multiply 1-p%

Page 44: Tips for Deep Learning - 國立臺灣大學

Dropout is a kind of ensemble.

Ensemble

Network1

Network2

Network3

Network4

Train a bunch of networks with different structures

Training Set

Set 1 Set 2 Set 3 Set 4

Page 45: Tips for Deep Learning - 國立臺灣大學

Dropout is a kind of ensemble.

Ensemble

y1

Network1

Network2

Network3

Network4

Testing data x

y2 y3 y4

average

Page 46: Tips for Deep Learning - 國立臺灣大學

Dropout is a kind of ensemble.

Training of Dropout

minibatch1

……

➢Using one mini-batch to train one network

➢Some parameters in the network are shared

minibatch2

minibatch3

minibatch4

M neurons

2M possible networks

Page 47: Tips for Deep Learning - 國立臺灣大學

Dropout is a kind of ensemble.

testing data xTesting of Dropout

……

average

y1 y2 y3

All the weights multiply 1-p%

≈ y?????

Page 48: Tips for Deep Learning - 國立臺灣大學

Testing of Dropout

w1 w2

x1 x2

w1 w2

x1 x2

w1 w2

x1 x2

w1 w2

x1 x2

z=w1x1+w2x2 z=w2x2

z=w1x1 z=0

x1 x2

w1 w2

1

2

1

2

x1 x2

w1 w2

z=w1x1+w2x2

𝑧 =1

2𝑤1𝑥1 +

1

2𝑤2𝑥2

Page 49: Tips for Deep Learning - 國立臺灣大學

Neural Network

Good Results on Testing Data?

Good Results on Training Data?

Step 3: pick the best function

Step 2: goodness of function

Step 1: define a set of function

YES

YES

NO

NO

Overfitting!

Recipe of Deep Learning

Page 50: Tips for Deep Learning - 國立臺灣大學

Try another task

http://top-breaking-news.com/

Machine

政治

體育

經濟

“president” in document

“stock” in document

體育 政治 財經

Page 51: Tips for Deep Learning - 國立臺灣大學

Try another task

Page 52: Tips for Deep Learning - 國立臺灣大學

Live Demo