Tips for Training Deep Network

Output

• Training Strategy: Batch Normalization

• Activation Function: SELU

• Network Structure: Highway Network

Batch Normalization

Feature Scaling

……

……

……

……

……

…… ……

𝑥1 𝑥2 𝑥3 𝑥𝑟 𝑥𝑅

mean: 𝑚𝑖

standard deviation: 𝜎𝑖

𝑥𝑖𝑟 ←

𝑥𝑖𝑟 −𝑚𝑖

𝜎𝑖

The means of all dimensions are 0, and the variances are all 1

For each dimension i:

𝑥11

𝑥21

𝑥12

𝑥22

In general, gradient descent converges much faster with feature scaling than without it.

How about Hidden Layer?

𝑎1𝑥1 Layer 1 Layer 2 𝑎2 ……

Feature Scaling Feature Scaling ? Feature Scaling ?

Smaller learning rate can be helpful, but the training would be slower.

Difficulty: their statistics change during the training …

Batch normalization

Internal Covariate Shift

𝑎3

𝑎2

𝑎1

Batch

𝑥1

𝑥2

𝑥3

𝑊1

𝑊1

𝑊1

𝑧1

𝑧2

𝑧3

𝑊2

𝑊2

𝑊2

Sigmo

id

……

……

……

𝑊1 𝑥1 𝑥2 𝑥3𝑧1 𝑧2 𝑧3 =

Sigmo

idSigm

oid

Batch

Batch normalization

𝑥1

𝑥2

𝑥3

𝑊1

𝑊1

𝑊1

𝑧1

𝑧2

𝑧3

𝜇 𝜎

𝜇 =1

3

𝑖=1

3

𝑧𝑖

𝜎 =1

3

𝑖=1

3

𝑧𝑖 − 𝜇 2

𝜇 and 𝜎depends on 𝑧𝑖

Note: Batch normalization cannot be applied on small batch.

Batch normalization

𝑥1

𝑥2

𝑥3

𝑊1

𝑊1

𝑊1

𝑧1

𝑧2

𝑧3

𝜇 𝜎

ǁ𝑧𝑖 =𝑧𝑖 − 𝜇

𝜎


𝑎3

𝑎2

𝑎1

Sigmo

idSigm

oid

Sigmo

id

ǁ𝑧1

ǁ𝑧2

ǁ𝑧3

How to do backpropogation?

Batch normalization

𝑥1

𝑥2

𝑥3

𝑊1

𝑊1

𝑊1

𝑧1

𝑧2

𝑧3

𝜇 𝜎

Ƹ𝑧𝑖 = 𝛾⨀ ǁ𝑧𝑖 + 𝛽


Ƹ𝑧3

Ƹ𝑧2

Ƹ𝑧1ǁ𝑧1

ǁ𝑧2

ǁ𝑧3

𝛽 𝛾

ǁ𝑧𝑖 =𝑧𝑖 − 𝜇

𝜎

Batch normalization

• At testing stage:

𝑥 𝑊1 𝑧 Ƹ𝑧ǁ𝑧Ƹ𝑧𝑖 = 𝛾⨀ ǁ𝑧𝑖 + 𝛽ǁ𝑧 =

𝑧 − 𝜇

𝜎

𝜇, 𝜎 are from batch

𝛾, 𝛽 are network parameters

We do not have batch at testing stage.

Ideal solution:

Computing 𝜇 and 𝜎 using the whole training dataset.

Practical solution:

Computing the moving average of 𝜇 and 𝜎 of the batches during training.

Acc

Updates

𝜇1

𝜇100𝜇300

Batch normalization - Benefit

• BN reduces training times, and make very deep net trainable.

• Because of less Covariate Shift, we can use larger learning rates.

• Less exploding/vanishing gradients

• Especially effective for sigmoid, tanh, etc.

• Learning is less affected by initialization.

• BN reduces the demand for regularization.

𝑥𝑖 𝑊1 𝑧𝑖 Ƹ𝑧𝑖ǁ𝑧𝑖

Ƹ𝑧𝑖 = 𝛾⨀ ǁ𝑧𝑖 + 𝛽ǁ𝑧𝑖 =𝑧𝑖 − 𝜇

𝜎

× 𝒌 × 𝒌

𝒌 𝒌

𝒌

𝒌𝒆𝒆𝒑

To learn more ……

• Batch Renormalization

• Layer Normalization

• Instance Normalization

• Weight Normalization

• Spectrum Normalization

Activation Function: SELU

ReLU

• Rectified Linear Unit (ReLU)

Reason:

1. Fast to compute

2. Biological reason

3. Infinite sigmoid with different biases

4. Vanishing gradient problem

𝑧

𝑎

𝑎 = 𝑧

𝑎 = 0

𝜎 𝑧

ReLU - variant

𝑧

𝑎

𝑎 = 𝑧

𝑎 = 0.01𝑧

𝐿𝑒𝑎𝑘𝑦 𝑅𝑒𝐿𝑈

𝑧

𝑎

𝑎 = 𝑧

𝑎 = 𝛼𝑧

𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑐 𝑅𝑒𝐿𝑈

α also learned by gradient descent

ReLU - variant

𝑧

𝑎

𝑎 = 𝑧

𝑎 = 𝛼 𝑒𝑧 − 1

Exponential Linear Unit (ELU)

𝑧

𝑎

𝑎 = 𝑧

𝑎 = 𝛼 𝑒𝑧 − 1

Scaled ELU (SELU)

https://github.com/bioinf-jku/SNNs

× 𝜆

× 𝜆

𝛼 = 1.6732632423543772848170429916717

𝜆 = 1.0507009873554804934193349852946

https://github.com/bioinf-jku/SNNs

SELU

Positive and negative values

The whole ReLU family has this property except the original ReLU.

Saturation region ELU also has this property

Slope larger than 1

𝑎 = 𝜆𝑧

𝑎 = 𝜆𝛼 𝑒𝑧 − 1

𝛼 = 1.673263242…

𝜆 = 1.050700987…

Only SELU also has this property

SELU

KKkk wawawaz 11

z

1w

kw

Kw

…

1a

ka

Ka

zf a

…

……

The inputs are i.i.d random variables with mean 𝜇 and variance 𝜎2.

𝜇𝑧 = 𝐸 𝑧

=

𝑘=1

𝐾

𝐸 𝑎𝑘 𝑤𝑘

𝜇= 𝜇

𝑘=1

𝐾

𝑤𝑘 = 𝜇 ∙ 𝐾𝜇𝑤

=0=1

=0

Do not have to be Gaussian

=0

SELU

KKkk wawawaz 11

z

1w

kw

Kw

…

1a

ka

Ka

zf a

…

……

The inputs are i.i.d random variables with mean 𝜇 and variance 𝜎2.

𝜇𝑧 = 0

𝜎𝑧2 = 𝐸 𝑧 − 𝜇𝑧

2 = 𝐸 𝑧2

=0=1

= 𝐸 𝑎1𝑤1 + 𝑎2𝑤2 +⋯ 2

𝐸 𝑎𝑘𝑤𝑘2 = 𝑤𝑘

2𝐸 𝑎𝑘2 = 𝑤𝑘

2𝜎2

𝐸 𝑎𝑖𝑎𝑗𝑤𝑖𝑤𝑗 = 𝑤𝑖𝑤𝑗𝐸 𝑎𝑖 𝐸 𝑎𝑗 = 0

=

𝑘=1

𝐾

𝑤𝑘2𝜎2 = 𝜎2 ∙ 𝐾𝜎𝑤

2

𝜇 = 0, 𝜎 = 1

=1=1

= 1

target

Assume Gaussian

𝜇𝑤 = 0

Source of joke: https://zhuanlan.zhihu.com/p/27336839

93 頁的證明

SELU is actually more general.

•最新激活神經元:SELF-NORMALIZATION NEURAL NETWORK (SELU)

MNIST CIFAR-10

https://data-sci.info/2017/06/11/%e6%9c%80%e6%96%b0%e6%bf%80%e6%b4%bb%e7%a5%9e%e7%b6%93%e5%85%83self-normalization-neural-network-selu/

Highway Network & Grid LSTM

x f1 a1 f2 a2 f3 a3 f4 y

𝑎𝑡 = 𝑓𝑙 𝑎𝑡−1 = 𝜎 𝑊𝑡𝑎𝑡−1 + 𝑏𝑡

x1

h0 f h1

x2

f

x3

h2 f

x4

h3 f y4

ℎ𝑡 = 𝑓 ℎ𝑡−1, 𝑥𝑡 = 𝜎 𝑊ℎℎ𝑡−1 +𝑊𝑖𝑥𝑡 + 𝑏𝑖

t is layer

t is time step

Applying gated structure in feedforward network

Feedforward v.s. Recurrent

1. Feedforward network does not have input at each step

2. Feedforward network has different parameters for each layer

GRU → Highway Network

ht-1

r z

yt

xtht-1

h'

⨀

xt

⨀

⨀1-

＋ ht

reset update

No input xt at each step

at-1 is the output of the (t-1)-th layer

at is the output of the t-th layer

No output yt at each step

No reset gate

at-1 at

at-1

Highway Network

• Residual Network• Highway Network

Deep Residual Learning for Image

Recognition

http://arxiv.org/abs/1512.03385

Training Very Deep Networks

https://arxiv.org/pdf/1507.0622

8v2.pdf

+

copycopy

Gate controller

ℎ′ = 𝜎 𝑊𝑎𝑡−1

𝑧 = 𝜎 𝑊′𝑎𝑡−1

𝑎𝑡 = 𝑧⊙ 𝑎𝑡−1 + 1 − 𝑧 ⊙ ℎ

𝑎𝑡−1

𝑎𝑡

𝑧

ℎ′

𝑎𝑡−1

𝑎𝑡

𝑎𝑡−1ℎ′

Input layer

output layer

Input layer

output layer

Input layer

output layer

Highway Network automatically determines the layers needed!

Highway Network

Grid LSTM

LSTM

y

x

c’

hth

cGrid

LSTM

c’

h’h

c

Memory for both time and depth

ba

b’a’

time

depth

GridLSTMht-1

ct-1

al bl

ht

ct

al-1 bl-1

GridLSTM

al bl

ht+1

ct+1

al-1 bl-1

GridLSTM’ht-1

ct-1

al+1 bl+1

ht

ct

GridLSTM’

al+1 bl+1

ht+1

ct+1

Grid LSTM

GridLSTM

c’

h’h

c

ba

b’a’

h'

zzizf zo

＋

h

c

⨀

⨀ ⨀tanh

c'

a

b

a'

b'

e’ f’

3D Grid LSTM

h

c

h’

c’

ba

e f

b’a’

3D Grid LSTM

• Images are composed of pixels

3 x 3 images

Tips for Training Deep Network -...

Documents

Transcript of Tips for Training Deep Network -...