Tips for Training Deep Network -...
Transcript of Tips for Training Deep Network -...
![Page 1: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/1.jpg)
Tips for Training Deep Network
![Page 2: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/2.jpg)
Output
• Training Strategy: Batch Normalization
• Activation Function: SELU
• Network Structure: Highway Network
![Page 3: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/3.jpg)
Batch Normalization
![Page 4: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/4.jpg)
Feature Scaling
……
……
……
……
……
…… ……
𝑥1 𝑥2 𝑥3 𝑥𝑟 𝑥𝑅
mean: 𝑚𝑖
standard deviation: 𝜎𝑖
𝑥𝑖𝑟 ←
𝑥𝑖𝑟 −𝑚𝑖
𝜎𝑖
The means of all dimensions are 0, and the variances are all 1
For each dimension i:
𝑥11
𝑥21
𝑥12
𝑥22
In general, gradient descent converges much faster with feature scaling than without it.
![Page 5: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/5.jpg)
How about Hidden Layer?
𝑎1𝑥1 Layer 1 Layer 2 𝑎2 ……
Feature Scaling Feature Scaling ? Feature Scaling ?
Smaller learning rate can be helpful, but the training would be slower.
Difficulty: their statistics change during the training …
Batch normalization
Internal Covariate Shift
![Page 6: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/6.jpg)
𝑎3
𝑎2
𝑎1
Batch
𝑥1
𝑥2
𝑥3
𝑊1
𝑊1
𝑊1
𝑧1
𝑧2
𝑧3
𝑊2
𝑊2
𝑊2
Sigmo
id
……
……
……
𝑊1 𝑥1 𝑥2 𝑥3𝑧1 𝑧2 𝑧3 =
Sigmo
idSigm
oid
Batch
![Page 7: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/7.jpg)
Batch normalization
𝑥1
𝑥2
𝑥3
𝑊1
𝑊1
𝑊1
𝑧1
𝑧2
𝑧3
𝜇 𝜎
𝜇 =1
3
𝑖=1
3
𝑧𝑖
𝜎 =1
3
𝑖=1
3
𝑧𝑖 − 𝜇 2
𝜇 and 𝜎depends on 𝑧𝑖
Note: Batch normalization cannot be applied on small batch.
![Page 8: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/8.jpg)
Batch normalization
𝑥1
𝑥2
𝑥3
𝑊1
𝑊1
𝑊1
𝑧1
𝑧2
𝑧3
𝜇 𝜎
ǁ𝑧𝑖 =𝑧𝑖 − 𝜇
𝜎
𝜇 and 𝜎depends on 𝑧𝑖
𝑎3
𝑎2
𝑎1
Sigmo
idSigm
oid
Sigmo
id
ǁ𝑧1
ǁ𝑧2
ǁ𝑧3
How to do backpropogation?
![Page 9: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/9.jpg)
Batch normalization
𝑥1
𝑥2
𝑥3
𝑊1
𝑊1
𝑊1
𝑧1
𝑧2
𝑧3
𝜇 𝜎
Ƹ𝑧𝑖 = 𝛾⨀ ǁ𝑧𝑖 + 𝛽
𝜇 and 𝜎depends on 𝑧𝑖
Ƹ𝑧3
Ƹ𝑧2
Ƹ𝑧1ǁ𝑧1
ǁ𝑧2
ǁ𝑧3
𝛽 𝛾
ǁ𝑧𝑖 =𝑧𝑖 − 𝜇
𝜎
![Page 10: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/10.jpg)
Batch normalization
• At testing stage:
𝑥 𝑊1 𝑧 Ƹ𝑧ǁ𝑧Ƹ𝑧𝑖 = 𝛾⨀ ǁ𝑧𝑖 + 𝛽ǁ𝑧 =
𝑧 − 𝜇
𝜎
𝜇, 𝜎 are from batch
𝛾, 𝛽 are network parameters
We do not have batch at testing stage.
Ideal solution:
Computing 𝜇 and 𝜎 using the whole training dataset.
Practical solution:
Computing the moving average of 𝜇 and 𝜎 of the batches during training.
Acc
Updates
𝜇1
𝜇100𝜇300
![Page 11: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/11.jpg)
Batch normalization - Benefit
• BN reduces training times, and make very deep net trainable.
• Because of less Covariate Shift, we can use larger learning rates.
• Less exploding/vanishing gradients
• Especially effective for sigmoid, tanh, etc.
• Learning is less affected by initialization.
• BN reduces the demand for regularization.
𝑥𝑖 𝑊1 𝑧𝑖 Ƹ𝑧𝑖ǁ𝑧𝑖
Ƹ𝑧𝑖 = 𝛾⨀ ǁ𝑧𝑖 + 𝛽ǁ𝑧𝑖 =𝑧𝑖 − 𝜇
𝜎
× 𝒌 × 𝒌
𝒌 𝒌
𝒌
𝒌𝒆𝒆𝒑
![Page 12: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/12.jpg)
![Page 13: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/13.jpg)
To learn more ……
• Batch Renormalization
• Layer Normalization
• Instance Normalization
• Weight Normalization
• Spectrum Normalization
![Page 14: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/14.jpg)
Activation Function: SELU
![Page 15: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/15.jpg)
ReLU
• Rectified Linear Unit (ReLU)
Reason:
1. Fast to compute
2. Biological reason
3. Infinite sigmoid with different biases
4. Vanishing gradient problem
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 0
𝜎 𝑧
![Page 16: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/16.jpg)
ReLU - variant
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 0.01𝑧
𝐿𝑒𝑎𝑘𝑦 𝑅𝑒𝐿𝑈
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 𝛼𝑧
𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑐 𝑅𝑒𝐿𝑈
α also learned by gradient descent
![Page 17: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/17.jpg)
ReLU - variant
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 𝛼 𝑒𝑧 − 1
Exponential Linear Unit (ELU)
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 𝛼 𝑒𝑧 − 1
Scaled ELU (SELU)
https://github.com/bioinf-jku/SNNs
× 𝜆
× 𝜆
𝛼 = 1.6732632423543772848170429916717
𝜆 = 1.0507009873554804934193349852946
![Page 18: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/18.jpg)
SELU
Positive and negative values
The whole ReLU family has this property except the original ReLU.
Saturation region ELU also has this property
Slope larger than 1
𝑎 = 𝜆𝑧
𝑎 = 𝜆𝛼 𝑒𝑧 − 1
𝛼 = 1.673263242…
𝜆 = 1.050700987…
Only SELU also has this property
![Page 19: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/19.jpg)
SELU
KKkk wawawaz 11
z
1w
kw
Kw
…
1a
ka
Ka
zf a
…
……
The inputs are i.i.d random variables with mean 𝜇 and variance 𝜎2.
𝜇𝑧 = 𝐸 𝑧
=
𝑘=1
𝐾
𝐸 𝑎𝑘 𝑤𝑘
𝜇= 𝜇
𝑘=1
𝐾
𝑤𝑘 = 𝜇 ∙ 𝐾𝜇𝑤
=0=1
=0
Do not have to be Gaussian
=0
![Page 20: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/20.jpg)
SELU
KKkk wawawaz 11
z
1w
kw
Kw
…
1a
ka
Ka
zf a
…
……
The inputs are i.i.d random variables with mean 𝜇 and variance 𝜎2.
𝜇𝑧 = 0
𝜎𝑧2 = 𝐸 𝑧 − 𝜇𝑧
2 = 𝐸 𝑧2
=0=1
= 𝐸 𝑎1𝑤1 + 𝑎2𝑤2 +⋯ 2
𝐸 𝑎𝑘𝑤𝑘2 = 𝑤𝑘
2𝐸 𝑎𝑘2 = 𝑤𝑘
2𝜎2
𝐸 𝑎𝑖𝑎𝑗𝑤𝑖𝑤𝑗 = 𝑤𝑖𝑤𝑗𝐸 𝑎𝑖 𝐸 𝑎𝑗 = 0
=
𝑘=1
𝐾
𝑤𝑘2𝜎2 = 𝜎2 ∙ 𝐾𝜎𝑤
2
𝜇 = 0, 𝜎 = 1
=1=1
= 1
target
Assume Gaussian
𝜇𝑤 = 0
![Page 21: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/21.jpg)
Demo
![Page 22: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/22.jpg)
Source of joke: https://zhuanlan.zhihu.com/p/27336839
93 頁的證明
SELU is actually more general.
![Page 23: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/23.jpg)
•最新激活神經元:SELF-NORMALIZATION NEURAL NETWORK (SELU)
MNIST CIFAR-10
![Page 24: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/24.jpg)
Demo
![Page 25: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/25.jpg)
Highway Network & Grid LSTM
![Page 26: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/26.jpg)
x f1 a1 f2 a2 f3 a3 f4 y
𝑎𝑡 = 𝑓𝑙 𝑎𝑡−1 = 𝜎 𝑊𝑡𝑎𝑡−1 + 𝑏𝑡
x1
h0 f h1
x2
f
x3
h2 f
x4
h3 f y4
ℎ𝑡 = 𝑓 ℎ𝑡−1, 𝑥𝑡 = 𝜎 𝑊ℎℎ𝑡−1 +𝑊𝑖𝑥𝑡 + 𝑏𝑖
t is layer
t is time step
Applying gated structure in feedforward network
Feedforward v.s. Recurrent
1. Feedforward network does not have input at each step
2. Feedforward network has different parameters for each layer
![Page 27: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/27.jpg)
GRU → Highway Network
ht-1
r z
yt
xtht-1
h'
⨀
xt
⨀
⨀1-
+ ht
reset update
No input xt at each step
at-1 is the output of the (t-1)-th layer
at is the output of the t-th layer
No output yt at each step
No reset gate
at-1 at
at-1
![Page 28: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/28.jpg)
Highway Network
• Residual Network• Highway Network
Deep Residual Learning for Image
Recognition
http://arxiv.org/abs/1512.03385
Training Very Deep Networks
https://arxiv.org/pdf/1507.0622
8v2.pdf
+
copycopy
Gate controller
ℎ′ = 𝜎 𝑊𝑎𝑡−1
𝑧 = 𝜎 𝑊′𝑎𝑡−1
𝑎𝑡 = 𝑧⊙ 𝑎𝑡−1 + 1 − 𝑧 ⊙ ℎ
𝑎𝑡−1
𝑎𝑡
𝑧
ℎ′
𝑎𝑡−1
𝑎𝑡
𝑎𝑡−1ℎ′
![Page 29: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/29.jpg)
Input layer
output layer
Input layer
output layer
Input layer
output layer
Highway Network automatically determines the layers needed!
![Page 30: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/30.jpg)
Highway Network
![Page 31: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/31.jpg)
Grid LSTM
LSTM
y
x
c’
hth
cGrid
LSTM
c’
h’h
c
Memory for both time and depth
ba
b’a’
time
depth
![Page 32: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/32.jpg)
GridLSTMht-1
ct-1
al bl
ht
ct
al-1 bl-1
GridLSTM
al bl
ht+1
ct+1
al-1 bl-1
GridLSTM’ht-1
ct-1
al+1 bl+1
ht
ct
GridLSTM’
al+1 bl+1
ht+1
ct+1
![Page 33: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/33.jpg)
Grid LSTM
GridLSTM
c’
h’h
c
ba
b’a’
h'
zzizf zo
+
h
c
⨀
⨀ ⨀tanh
c'
a
b
a'
b'
![Page 34: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/34.jpg)
e’ f’
3D Grid LSTM
h
c
h’
c’
ba
e f
b’a’
![Page 35: Tips for Training Deep Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pdf · •BN reduces training times, and make very deep net trainable.](https://reader033.fdocuments.in/reader033/viewer/2022052019/60339190646e2f604c217104/html5/thumbnails/35.jpg)
3D Grid LSTM
• Images are composed of pixels
3 x 3 images