Tips for Training Deep Network -...

35
Tips for Training Deep Network

Transcript of Tips for Training Deep Network -...

Page 1: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

Tips for Training Deep Network

Page 2: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

Output

• Training Strategy: Batch Normalization

• Activation Function: SELU

• Network Structure: Highway Network

Page 3: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

Batch Normalization

Page 4: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

Feature Scaling

……

……

……

……

……

…… ……

𝑥1 𝑥2 𝑥3 𝑥𝑟 𝑥𝑅

mean: 𝑚𝑖

standard deviation: 𝜎𝑖

𝑥𝑖𝑟 ←

𝑥𝑖𝑟 −𝑚𝑖

𝜎𝑖

The means of all dimensions are 0, and the variances are all 1

For each dimension i:

𝑥11

𝑥21

𝑥12

𝑥22

In general, gradient descent converges much faster with feature scaling than without it.

Page 5: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

How about Hidden Layer?

𝑎1𝑥1 Layer 1 Layer 2 𝑎2 ……

Feature Scaling Feature Scaling ? Feature Scaling ?

Smaller learning rate can be helpful, but the training would be slower.

Difficulty: their statistics change during the training …

Batch normalization

Internal Covariate Shift

Page 6: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

𝑎3

𝑎2

𝑎1

Batch

𝑥1

𝑥2

𝑥3

𝑊1

𝑊1

𝑊1

𝑧1

𝑧2

𝑧3

𝑊2

𝑊2

𝑊2

Sigmo

id

……

……

……

𝑊1 𝑥1 𝑥2 𝑥3𝑧1 𝑧2 𝑧3 =

Sigmo

idSigm

oid

Batch

Page 7: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

Batch normalization

𝑥1

𝑥2

𝑥3

𝑊1

𝑊1

𝑊1

𝑧1

𝑧2

𝑧3

𝜇 𝜎

𝜇 =1

3

𝑖=1

3

𝑧𝑖

𝜎 =1

3

𝑖=1

3

𝑧𝑖 − 𝜇 2

𝜇 and 𝜎depends on 𝑧𝑖

Note: Batch normalization cannot be applied on small batch.

Page 8: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

Batch normalization

𝑥1

𝑥2

𝑥3

𝑊1

𝑊1

𝑊1

𝑧1

𝑧2

𝑧3

𝜇 𝜎

ǁ𝑧𝑖 =𝑧𝑖 − 𝜇

𝜎

𝜇 and 𝜎depends on 𝑧𝑖

𝑎3

𝑎2

𝑎1

Sigmo

idSigm

oid

Sigmo

id

ǁ𝑧1

ǁ𝑧2

ǁ𝑧3

How to do backpropogation?

Page 9: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

Batch normalization

𝑥1

𝑥2

𝑥3

𝑊1

𝑊1

𝑊1

𝑧1

𝑧2

𝑧3

𝜇 𝜎

Ƹ𝑧𝑖 = 𝛾⨀ ǁ𝑧𝑖 + 𝛽

𝜇 and 𝜎depends on 𝑧𝑖

Ƹ𝑧3

Ƹ𝑧2

Ƹ𝑧1ǁ𝑧1

ǁ𝑧2

ǁ𝑧3

𝛽 𝛾

ǁ𝑧𝑖 =𝑧𝑖 − 𝜇

𝜎

Page 10: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

Batch normalization

• At testing stage:

𝑥 𝑊1 𝑧 Ƹ𝑧ǁ𝑧Ƹ𝑧𝑖 = 𝛾⨀ ǁ𝑧𝑖 + 𝛽ǁ𝑧 =

𝑧 − 𝜇

𝜎

𝜇, 𝜎 are from batch

𝛾, 𝛽 are network parameters

We do not have batch at testing stage.

Ideal solution:

Computing 𝜇 and 𝜎 using the whole training dataset.

Practical solution:

Computing the moving average of 𝜇 and 𝜎 of the batches during training.

Acc

Updates

𝜇1

𝜇100𝜇300

Page 11: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

Batch normalization - Benefit

• BN reduces training times, and make very deep net trainable.

• Because of less Covariate Shift, we can use larger learning rates.

• Less exploding/vanishing gradients

• Especially effective for sigmoid, tanh, etc.

• Learning is less affected by initialization.

• BN reduces the demand for regularization.

𝑥𝑖 𝑊1 𝑧𝑖 Ƹ𝑧𝑖ǁ𝑧𝑖

Ƹ𝑧𝑖 = 𝛾⨀ ǁ𝑧𝑖 + 𝛽ǁ𝑧𝑖 =𝑧𝑖 − 𝜇

𝜎

× 𝒌 × 𝒌

𝒌 𝒌

𝒌

𝒌𝒆𝒆𝒑

Page 12: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.
Page 13: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

To learn more ……

• Batch Renormalization

• Layer Normalization

• Instance Normalization

• Weight Normalization

• Spectrum Normalization

Page 14: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

Activation Function: SELU

Page 15: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

ReLU

• Rectified Linear Unit (ReLU)

Reason:

1. Fast to compute

2. Biological reason

3. Infinite sigmoid with different biases

4. Vanishing gradient problem

𝑧

𝑎

𝑎 = 𝑧

𝑎 = 0

𝜎 𝑧

Page 16: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

ReLU - variant

𝑧

𝑎

𝑎 = 𝑧

𝑎 = 0.01𝑧

𝐿𝑒𝑎𝑘𝑦 𝑅𝑒𝐿𝑈

𝑧

𝑎

𝑎 = 𝑧

𝑎 = 𝛼𝑧

𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑐 𝑅𝑒𝐿𝑈

α also learned by gradient descent

Page 17: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

ReLU - variant

𝑧

𝑎

𝑎 = 𝑧

𝑎 = 𝛼 𝑒𝑧 − 1

Exponential Linear Unit (ELU)

𝑧

𝑎

𝑎 = 𝑧

𝑎 = 𝛼 𝑒𝑧 − 1

Scaled ELU (SELU)

https://github.com/bioinf-jku/SNNs

× 𝜆

× 𝜆

𝛼 = 1.6732632423543772848170429916717

𝜆 = 1.0507009873554804934193349852946

Page 18: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

SELU

Positive and negative values

The whole ReLU family has this property except the original ReLU.

Saturation region ELU also has this property

Slope larger than 1

𝑎 = 𝜆𝑧

𝑎 = 𝜆𝛼 𝑒𝑧 − 1

𝛼 = 1.673263242…

𝜆 = 1.050700987…

Only SELU also has this property

Page 19: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

SELU

KKkk wawawaz 11

z

1w

kw

Kw

1a

ka

Ka

zf a

……

The inputs are i.i.d random variables with mean 𝜇 and variance 𝜎2.

𝜇𝑧 = 𝐸 𝑧

=

𝑘=1

𝐾

𝐸 𝑎𝑘 𝑤𝑘

𝜇= 𝜇

𝑘=1

𝐾

𝑤𝑘 = 𝜇 ∙ 𝐾𝜇𝑤

=0=1

=0

Do not have to be Gaussian

=0

Page 20: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

SELU

KKkk wawawaz 11

z

1w

kw

Kw

1a

ka

Ka

zf a

……

The inputs are i.i.d random variables with mean 𝜇 and variance 𝜎2.

𝜇𝑧 = 0

𝜎𝑧2 = 𝐸 𝑧 − 𝜇𝑧

2 = 𝐸 𝑧2

=0=1

= 𝐸 𝑎1𝑤1 + 𝑎2𝑤2 +⋯ 2

𝐸 𝑎𝑘𝑤𝑘2 = 𝑤𝑘

2𝐸 𝑎𝑘2 = 𝑤𝑘

2𝜎2

𝐸 𝑎𝑖𝑎𝑗𝑤𝑖𝑤𝑗 = 𝑤𝑖𝑤𝑗𝐸 𝑎𝑖 𝐸 𝑎𝑗 = 0

=

𝑘=1

𝐾

𝑤𝑘2𝜎2 = 𝜎2 ∙ 𝐾𝜎𝑤

2

𝜇 = 0, 𝜎 = 1

=1=1

= 1

target

Assume Gaussian

𝜇𝑤 = 0

Page 21: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

Demo

Page 22: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

Source of joke: https://zhuanlan.zhihu.com/p/27336839

93 頁的證明

SELU is actually more general.

Page 24: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

Demo

Page 25: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

Highway Network & Grid LSTM

Page 26: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

x f1 a1 f2 a2 f3 a3 f4 y

𝑎𝑡 = 𝑓𝑙 𝑎𝑡−1 = 𝜎 𝑊𝑡𝑎𝑡−1 + 𝑏𝑡

x1

h0 f h1

x2

f

x3

h2 f

x4

h3 f y4

ℎ𝑡 = 𝑓 ℎ𝑡−1, 𝑥𝑡 = 𝜎 𝑊ℎℎ𝑡−1 +𝑊𝑖𝑥𝑡 + 𝑏𝑖

t is layer

t is time step

Applying gated structure in feedforward network

Feedforward v.s. Recurrent

1. Feedforward network does not have input at each step

2. Feedforward network has different parameters for each layer

Page 27: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

GRU → Highway Network

ht-1

r z

yt

xtht-1

h'

xt

⨀1-

+ ht

reset update

No input xt at each step

at-1 is the output of the (t-1)-th layer

at is the output of the t-th layer

No output yt at each step

No reset gate

at-1 at

at-1

Page 28: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

Highway Network

• Residual Network• Highway Network

Deep Residual Learning for Image

Recognition

http://arxiv.org/abs/1512.03385

Training Very Deep Networks

https://arxiv.org/pdf/1507.0622

8v2.pdf

+

copycopy

Gate controller

ℎ′ = 𝜎 𝑊𝑎𝑡−1

𝑧 = 𝜎 𝑊′𝑎𝑡−1

𝑎𝑡 = 𝑧⊙ 𝑎𝑡−1 + 1 − 𝑧 ⊙ ℎ

𝑎𝑡−1

𝑎𝑡

𝑧

ℎ′

𝑎𝑡−1

𝑎𝑡

𝑎𝑡−1ℎ′

Page 29: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

Input layer

output layer

Input layer

output layer

Input layer

output layer

Highway Network automatically determines the layers needed!

Page 30: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

Highway Network

Page 31: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

Grid LSTM

LSTM

y

x

c’

hth

cGrid

LSTM

c’

h’h

c

Memory for both time and depth

ba

b’a’

time

depth

Page 32: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

GridLSTMht-1

ct-1

al bl

ht

ct

al-1 bl-1

GridLSTM

al bl

ht+1

ct+1

al-1 bl-1

GridLSTM’ht-1

ct-1

al+1 bl+1

ht

ct

GridLSTM’

al+1 bl+1

ht+1

ct+1

Page 33: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

Grid LSTM

GridLSTM

c’

h’h

c

ba

b’a’

h'

zzizf zo

h

c

⨀ ⨀tanh

c'

a

b

a'

b'

Page 34: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

e’ f’

3D Grid LSTM

h

c

h’

c’

ba

e f

b’a’

Page 35: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.

3D Grid LSTM

• Images are composed of pixels

3 x 3 images