๐0 ๐1
โ๐
Init
Compute Gradient
Update
TrainingData
๐2
โ๐
Compute Gradient
Update
TrainingData
๐
NetworkStructure
Learning Algorithm(Function ๐น)
๐
Can we learn more than initialization parameters?
The learning algorithm looks like RNN.
Recurrent Neural Network
โข Given function f: โโฒ, ๐ฆ = ๐ โ, ๐ฅ
fh0 h1
y1
x1
f h2
y2
x2
f h3
y3
x3
โฆโฆ
No matter how long the input/output sequence is, we only need one function f
h and hโ are vectors with the same dimension
LSTM
c change slowly
h change faster
ct is ct-1 added by something
ht and ht-1 can be very different
NaรฏveRNN
ht
yt
xt
ht-1
LSTM
yt
ct
htht-1
ct-1
xt
xt
zzizf zo
ht-1
ct-1
zxt
ht-1W= ๐ก๐๐โ
zixt
ht-1Wi= ๐
zfxt
ht-1Wf= ๐
zoxt
ht-1Wo= ๐
Review: LSTM
input
forget
output
ht
๐๐ก = ๐ง๐โจ๐๐กโ1+๐ง๐โจ๐ง
xt
zzizf zo
๏ผ
yt
ht-1
ct-1 ct
โ๐ก = ๐ง๐โจ๐ก๐๐โ ๐๐กโจ
โจ โจtanh
๐ฆ๐ก = ๐ ๐โฒโ๐ก
Review: LSTM
xt
zzizf zo
โจ ๏ผ
yt
ht-1
ct-1 ct
xt+1
zzizf zo
๏ผ
yt+1
ht
ct+1
โจ
โจ โจ
โจ
โจtanh tanh
ht
Review: LSTM
ht
๐๐ก = ๐ง๐โจ๐๐กโ1+๐ง๐โจ๐ง
xt
zzizf zo
๏ผ
yt
ht-1
ct-1 ct
โ๐ก = ๐ง๐โจ๐ก๐๐โ ๐๐กโจ
โจ โจtanh
๐ฆ๐ก = ๐ ๐โฒโ๐ก
Similar to gradient descent based algorithm
๐๐กโ1 ๐๐ก
๐๐ก ๐๐กโ1
โโ๐๐
๐๐ก = ๐๐กโ1 โ ๐โ๐๐
โโ๐๐
all โ1โ all ๐
โโ๐๐
11โฎ
๐๐โฎ
๐๐ก = ๐ง๐โจ๐๐กโ1+๐ง๐โจ๐ง
xt
zzizf
๏ผ
ht-1
ct-1 ct
โจ
โจ
Similar to gradient descent based algorithm
๐๐กโ1 ๐๐ก
๐๐ก ๐๐กโ1
โโ๐๐
๐๐ก = ๐๐กโ1 โ ๐โ๐๐
โโ๐๐
all โ1โ all ๐
โโ๐๐
Dynamic learning rate
Something like regularization
How about machine learn to determine ๐ง๐ and ๐ง๐ from โโ๐๐and other information?Other
LSTM for Gradient Descent
โLSTMโ๐0 ๐1
โโ๐๐
๐2
โโ๐๐
๐3
โโ๐๐
โLSTMโ โLSTMโ
3 training steps Testing Data ๐ ๐3
Batch from train
Batch from train
Batch from train
Learnable
Typical LSTM
๐๐ก = ๐ง๐โจ๐๐กโ1+๐ง๐โจ๐ง๐๐ก ๐๐กโ1 โโ๐๐ Learn to minimize
Independence assumption
The LSTM used only has one cell. Share across all parameters
โLSTMโ๐10 ๐1
1
โโ๐1๐
๐12 ๐1
3โLSTMโ โLSTMโ
โข Reasonable model sizeโข In typical gradient descent, all the parameters use the
same update ruleโข Training and testing model architectures can be different.
Real Implementation
โโ๐1๐ โโ๐1๐
โLSTMโ๐20 ๐2
1
โโ๐2๐
๐22 ๐2
3โLSTMโ โLSTMโ
โโ๐2๐ โโ๐2๐
Experimental Resultshttps://openreview.net/forum?id=rJY0-Kcll¬eId=ryq49XyLg
๐๐ก = ๐ง๐โจ๐๐กโ1+๐ง๐โจ๐ง๐๐ก ๐๐กโ1 โโ๐๐
LSTM for Gradient Descent (v2)
โLSTMโ๐0 ๐1
โโ๐๐
๐2
โโ๐๐
๐3
โโ๐๐
โLSTMโ โLSTMโ
3 training steps
Testing Data ๐ ๐3
Batch from train
Batch from train
Batch from train
โLSTMโ๐0 ๐1 ๐2โLSTMโ โLSTMโ
m can store previous gradients
Even more crazy idea โฆ
โข Input:
โข Training data and their labels
โข Testing data
โข Output:
โข Predicted label of testing data
Training Data & Labels
cat
cat dog
Testing Data
Learning + Prediction
Face Verification
https://support.apple.com/zh-tw/HT208109
Testing
Unlock your phone by Face
Training
Registration (Collecting Training data)
In each task:
Few-shot Learning
Training Tasks
Testing Tasks
Train Test
Train Test
Yes
Train Test No
Meta Learning
Train Test No
YesorNo
Network
Network
Network
Yes
No
?
Same approach for Speaker Verification
Siamese Network Network
No
CNN
Similarity
CNN
shareem
be
dd
ing
emb
ed
din
g
Train
Test
score
Large score
Small score
Yes
No
Same
Siamese Network Network
No
CNN
Similarity
CNN
shareem
be
dd
ing
emb
ed
din
g
Train
Test
score
Large score
Small score
Yes
No
Different
Siamese Network- Intuitive Explanation โข Binary classification problem: โAre they the same?โ
Training Set
Testing Set ?
same
different
different
Network
Same or Different
Face1
Face2
Siamese Network- Intuitive Explanation
Learning embedding for faces
e.g. learn to ignore the background
CNN
CNN
CNNFar
away
As close as possible
To learn more โฆ
โข What kind of distance should we use?
โข SphereFace: Deep Hypersphere Embedding for Face Recognition
โข Additive Margin Softmax for Face Verification
โข ArcFace: Additive Angular Margin Loss for Deep Face Recognition
โข Triplet loss
โข Deep Metric Learning using Triplet Network
โข FaceNet: A Unified Embedding for Face Recognition and Clustering
N-way Few/One-shot Learning
โข Example: 5-ways 1-shot
Training Data (Each image represents a class)
Testing Data
Network (Learning + Prediction)
ไธ่ฑ ไบไน ไธ็ ๅ่ ไบๆ
ไธ็
Testing Data
Prototypical Network
https://arxiv.org/abs/1703.05175
CNN CNN CNN CNN CNN
CNN
= similarity
๐ 1 ๐ 2 ๐ 3 ๐ 4 ๐ 5
Matching Network
https://arxiv.org/abs/1606.04080
Bidirectional LSTM
CNN
= similarity
๐ 1 ๐ 2 ๐ 3 ๐ 4 ๐ 5
Considering the relationship among the training examples
Few-shot learning for imaginary data
https://arxiv.org/abs/1801.05401
Few-shot learning for imaginary data
https://arxiv.org/abs/1801.05401
Top Related