Download - Meta Learning (Part 2)speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture/Meta2 (v4).pdfRecurrent Neural Network โ€ขGiven function f: โ„Žโ€ฒ, =๐‘“โ„Ž, h0 f h1 y1 x1 f h2 y2 x2 f

Transcript

Meta Learning (Part 2):Gradient Descent as LSTMHung-yi Lee

๐œƒ0 ๐œƒ1

โˆ‡๐‘™

Init

Compute Gradient

Update

TrainingData

๐œƒ2

โˆ‡๐‘™

Compute Gradient

Update

TrainingData

๐œƒ

NetworkStructure

Learning Algorithm(Function ๐น)

๐œ™

Can we learn more than initialization parameters?

The learning algorithm looks like RNN.

Recurrent Neural Network

โ€ข Given function f: โ„Žโ€ฒ, ๐‘ฆ = ๐‘“ โ„Ž, ๐‘ฅ

fh0 h1

y1

x1

f h2

y2

x2

f h3

y3

x3

โ€ฆโ€ฆ

No matter how long the input/output sequence is, we only need one function f

h and hโ€™ are vectors with the same dimension

LSTM

c change slowly

h change faster

ct is ct-1 added by something

ht and ht-1 can be very different

NaรฏveRNN

ht

yt

xt

ht-1

LSTM

yt

ct

htht-1

ct-1

xt

xt

zzizf zo

ht-1

ct-1

zxt

ht-1W= ๐‘ก๐‘Ž๐‘›โ„Ž

zixt

ht-1Wi= ๐œŽ

zfxt

ht-1Wf= ๐œŽ

zoxt

ht-1Wo= ๐œŽ

Review: LSTM

input

forget

output

ht

๐‘๐‘ก = ๐‘ง๐‘“โจ€๐‘๐‘กโˆ’1+๐‘ง๐‘–โจ€๐‘ง

xt

zzizf zo

๏ผ‹

yt

ht-1

ct-1 ct

โ„Ž๐‘ก = ๐‘ง๐‘œโจ€๐‘ก๐‘Ž๐‘›โ„Ž ๐‘๐‘กโจ€

โจ€ โจ€tanh

๐‘ฆ๐‘ก = ๐œŽ ๐‘Šโ€ฒโ„Ž๐‘ก

Review: LSTM

xt

zzizf zo

โจ€ ๏ผ‹

yt

ht-1

ct-1 ct

xt+1

zzizf zo

๏ผ‹

yt+1

ht

ct+1

โจ€

โจ€ โจ€

โจ€

โจ€tanh tanh

ht

Review: LSTM

ht

๐‘๐‘ก = ๐‘ง๐‘“โจ€๐‘๐‘กโˆ’1+๐‘ง๐‘–โจ€๐‘ง

xt

zzizf zo

๏ผ‹

yt

ht-1

ct-1 ct

โ„Ž๐‘ก = ๐‘ง๐‘œโจ€๐‘ก๐‘Ž๐‘›โ„Ž ๐‘๐‘กโจ€

โจ€ โจ€tanh

๐‘ฆ๐‘ก = ๐œŽ ๐‘Šโ€ฒโ„Ž๐‘ก

Similar to gradient descent based algorithm

๐œƒ๐‘กโˆ’1 ๐œƒ๐‘ก

๐œƒ๐‘ก ๐œƒ๐‘กโˆ’1

โˆ’โˆ‡๐œƒ๐‘™

๐œƒ๐‘ก = ๐œƒ๐‘กโˆ’1 โˆ’ ๐œ‚โˆ‡๐œƒ๐‘™

โˆ’โˆ‡๐œƒ๐‘™

all โ€œ1โ€ all ๐œ‚

โˆ’โˆ‡๐œƒ๐‘™

11โ‹ฎ

๐œ‚๐œ‚โ‹ฎ

๐‘๐‘ก = ๐‘ง๐‘“โจ€๐‘๐‘กโˆ’1+๐‘ง๐‘–โจ€๐‘ง

xt

zzizf

๏ผ‹

ht-1

ct-1 ct

โจ€

โจ€

Similar to gradient descent based algorithm

๐œƒ๐‘กโˆ’1 ๐œƒ๐‘ก

๐œƒ๐‘ก ๐œƒ๐‘กโˆ’1

โˆ’โˆ‡๐œƒ๐‘™

๐œƒ๐‘ก = ๐œƒ๐‘กโˆ’1 โˆ’ ๐œ‚โˆ‡๐œƒ๐‘™

โˆ’โˆ‡๐œƒ๐‘™

all โ€œ1โ€ all ๐œ‚

โˆ’โˆ‡๐œƒ๐‘™

Dynamic learning rate

Something like regularization

How about machine learn to determine ๐‘ง๐‘“ and ๐‘ง๐‘– from โˆ’โˆ‡๐œƒ๐‘™and other information?Other

LSTM for Gradient Descent

โ€œLSTMโ€๐œƒ0 ๐œƒ1

โˆ’โˆ‡๐œƒ๐‘™

๐œƒ2

โˆ’โˆ‡๐œƒ๐‘™

๐œƒ3

โˆ’โˆ‡๐œƒ๐‘™

โ€œLSTMโ€ โ€œLSTMโ€

3 training steps Testing Data ๐‘™ ๐œƒ3

Batch from train

Batch from train

Batch from train

Learnable

Typical LSTM

๐‘๐‘ก = ๐‘ง๐‘“โจ€๐‘๐‘กโˆ’1+๐‘ง๐‘–โจ€๐‘ง๐œƒ๐‘ก ๐œƒ๐‘กโˆ’1 โˆ’โˆ‡๐œƒ๐‘™ Learn to minimize

Independence assumption

The LSTM used only has one cell. Share across all parameters

โ€œLSTMโ€๐œƒ10 ๐œƒ1

1

โˆ’โˆ‡๐œƒ1๐‘™

๐œƒ12 ๐œƒ1

3โ€œLSTMโ€ โ€œLSTMโ€

โžข Reasonable model sizeโžข In typical gradient descent, all the parameters use the

same update ruleโžข Training and testing model architectures can be different.

Real Implementation

โˆ’โˆ‡๐œƒ1๐‘™ โˆ’โˆ‡๐œƒ1๐‘™

โ€œLSTMโ€๐œƒ20 ๐œƒ2

1

โˆ’โˆ‡๐œƒ2๐‘™

๐œƒ22 ๐œƒ2

3โ€œLSTMโ€ โ€œLSTMโ€

โˆ’โˆ‡๐œƒ2๐‘™ โˆ’โˆ‡๐œƒ2๐‘™

Experimental Resultshttps://openreview.net/forum?id=rJY0-Kcll&noteId=ryq49XyLg

๐‘๐‘ก = ๐‘ง๐‘“โจ€๐‘๐‘กโˆ’1+๐‘ง๐‘–โจ€๐‘ง๐œƒ๐‘ก ๐œƒ๐‘กโˆ’1 โˆ’โˆ‡๐œƒ๐‘™

Parameter update depends on not only current gradient, but previous gradients.

LSTM for Gradient Descent (v2)

โ€œLSTMโ€๐œƒ0 ๐œƒ1

โˆ’โˆ‡๐œƒ๐‘™

๐œƒ2

โˆ’โˆ‡๐œƒ๐‘™

๐œƒ3

โˆ’โˆ‡๐œƒ๐‘™

โ€œLSTMโ€ โ€œLSTMโ€

3 training steps

Testing Data ๐‘™ ๐œƒ3

Batch from train

Batch from train

Batch from train

โ€œLSTMโ€๐‘š0 ๐‘š1 ๐‘š2โ€œLSTMโ€ โ€œLSTMโ€

m can store previous gradients

Experimental Results

https://arxiv.org/abs/1606.04474

Meta Learning (Part 3)Hung-yi Lee

Even more crazy idea โ€ฆ

โ€ข Input:

โ€ข Training data and their labels

โ€ข Testing data

โ€ข Output:

โ€ข Predicted label of testing data

Training Data & Labels

cat

cat dog

Testing Data

Learning + Prediction

Face Verification

https://support.apple.com/zh-tw/HT208109

Testing

Unlock your phone by Face

Training

Registration (Collecting Training data)

In each task:

Few-shot Learning

Training Tasks

Testing Tasks

Train Test

Train Test

Yes

Train Test No

Meta Learning

Train Test No

YesorNo

Network

Network

Network

Yes

No

?

Same approach for Speaker Verification

Siamese Network Network

No

CNN

Similarity

CNN

shareem

be

dd

ing

emb

ed

din

g

Train

Test

score

Large score

Small score

Yes

No

Same

Siamese Network Network

No

CNN

Similarity

CNN

shareem

be

dd

ing

emb

ed

din

g

Train

Test

score

Large score

Small score

Yes

No

Different

Siamese Network- Intuitive Explanation โ€ข Binary classification problem: โ€œAre they the same?โ€

Training Set

Testing Set ?

same

different

different

Network

Same or Different

Face1

Face2

Siamese Network- Intuitive Explanation

Learning embedding for faces

e.g. learn to ignore the background

CNN

CNN

CNNFar

away

As close as possible

To learn more โ€ฆ

โ€ข What kind of distance should we use?

โ€ข SphereFace: Deep Hypersphere Embedding for Face Recognition

โ€ข Additive Margin Softmax for Face Verification

โ€ข ArcFace: Additive Angular Margin Loss for Deep Face Recognition

โ€ข Triplet loss

โ€ข Deep Metric Learning using Triplet Network

โ€ข FaceNet: A Unified Embedding for Face Recognition and Clustering

N-way Few/One-shot Learning

โ€ข Example: 5-ways 1-shot

Training Data (Each image represents a class)

Testing Data

Network (Learning + Prediction)

ไธ€่Šฑ ไบŒไนƒ ไธ‰็Ž– ๅ››่‘‰ ไบ”ๆœˆ

ไธ‰็Ž–

Testing Data

Prototypical Network

https://arxiv.org/abs/1703.05175

CNN CNN CNN CNN CNN

CNN

= similarity

๐‘ 1 ๐‘ 2 ๐‘ 3 ๐‘ 4 ๐‘ 5

Matching Network

https://arxiv.org/abs/1606.04080

Bidirectional LSTM

CNN

= similarity

๐‘ 1 ๐‘ 2 ๐‘ 3 ๐‘ 4 ๐‘ 5

Considering the relationship among the training examples

Relation Network

โ€ข ???

https://arxiv.org/abs/1711.06025

Few-shot learning for imaginary data

https://arxiv.org/abs/1801.05401

Few-shot learning for imaginary data

https://arxiv.org/abs/1801.05401