Restrict learning to a model-dependent “easy” set of samples General form of objective:

1
Restrict learning to a model-dependent “easy” set of samples General form of objective: Introduce indicator of “easiness” v i : K determines threshold for a set being easy, which is annealed over successive iterations until all samples used Self-Paced Learning for Latent Variable Models M. Pawan Kumar, Ben Packer, and Daphne Koller Motivation Learning Latent Variable Models Experiments Intuitions from Human Learning: • all information at once may be confusing => bad local minima • start with “easy” examples the learner is prepared to handle Maximize log likelihood: max w i log P(x i ,y i ;w) Iterate: • Find expected value of hidden variables using current w • Update w to maximize log likelihood subject to this expectation Self-Paced Learning ?? Large K Medium K Small K Optimization Image label y is object class only, h is bounding box Ψ(x i ,y i ,h i ) is HOG features in bounding box (offset by class) Object Classification – Mammals Dataset Motif Finding – UniProbe Dataset x is DNA sequence, h is motif position, y is binding affinity Handwriting Recognition - MNIST x is raw image, y is digit, h is image rotation, use linear kernel 1 vs. 7 2 vs. 7 3 vs. 8 8 vs. 9 Noun Phrase Coreference – MUC6 x consists of pairwise features between pairs of nouns y is a clustering of nouns h specifies a forest of nouns s.t. each tree is a cluster of nouns Aim: To learn an accurate set of parameters for latent variable models Okay… Got it! Standard Learning Self-Paced Learning Latent Variable Models x y h x : input or observed variables y : output or observed variables h : hidden/latent variables y = “Deer” x h Goal: Given D = {(x 1 ,y 1 ), …, (x n ,y n )}, learn parameters w. Expectation-Maximization for Maximum Likelihood Minimize upper bound on risk min w ||w|| 2 + C· i max y’,h’ [w·Ψ(x i ,y’,h’) + Δ(y i ,y’,h’)] - C· i max h [w·Ψ (x i ,y i ,h)] Iterate: • Impute hidden variables • Update weights to minimize upper bound on risk given these hidden variables Latent Struct SVM [2] Initialize K to be large Iterate: Run inference over h Alternatively update w and v: v set by sorting l i (w), comparing to threshold 1/K Perform normal update for w over subset of data Until convergence Anneal K K/μ Until all v i = 1, cannot reduce objective within tolerance Easier subsets in early iterations, avoids learning from samples whose hidden variables are imputed incorrectly Iteration 1 Iteration 3 Iteration 5 Iteration 7 min w r(w) + i l i (w) min w r(w) + i v i l i (w) – 1/K i v i h = Bounding Box Bengio et al. [1]: user-specified ordering “Self-paced” schedule of examples is automatically set by learner • task-specific • onerous on user • “easy for human” “easy for computer” • “easy for Learner A” “easy for Learner B” Training Error (%) 0 0.25 0.5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Objective 4 4.25 4.5 4.75 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 0 5 10 15 20 25 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Test Error (%) CCCP SPL Object Classification Motif Finding Discussion Compare Self-Paced Learning to standard CCCP as in [2] • Self-paced strategy outperforms state of the art • Global solvers for biconvex optimization may improve accuracy • Method is ideally suited to handle multiple levels of annotations [1] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, 2009. [2] C.-N. Yu and T. Joachims. Learning structural SVMs with latent variables. 15 20 25 30 35 40 Training Error (%) 0 20 40 60 80 100 120 Objective 15 20 25 30 35 40 CCCP SPL Test Error (%)

description

Small K. Medium K. Large K. Objective. Training Error (%). Test Error (%). CCCP. SPL. h. x. y. h. x. 1 vs. 7. 2 vs. 7. 3 vs. 8. 8 vs. 9. Iteration 1. Iteration 3. Iteration 5. Iteration 7. Self-Paced Learning for Latent Variable Models. - PowerPoint PPT Presentation

Transcript of Restrict learning to a model-dependent “easy” set of samples General form of objective:

Page 1: Restrict learning to a model-dependent “easy” set of samples General form of objective:

Restrict learning to a model-dependent “easy” set of samples

General form of objective:

Introduce indicator of “easiness” vi:

K determines threshold for a set being easy, which is annealed over successive iterations until all samples used

Self-Paced Learning for Latent Variable Models M. Pawan Kumar, Ben Packer, and Daphne Koller

Motivation

Learning Latent Variable Models

Experiments

Intuitions from Human Learning: • all information at once may be confusing => bad local minima• start with “easy” examples the learner is prepared to handle

Maximize log likelihood: maxw i log P(xi,yi;w)Iterate:

• Find expected value of hidden variables using current w • Update w to maximize log likelihood subject to this expectation

Self-Paced Learning

??

Large K Medium K Small K

Optimization

Image label y is object class only, h is bounding boxΨ(xi,yi,hi) is HOG features in bounding box (offset by class)

Object Classification – Mammals Dataset

Motif Finding – UniProbe Dataset

x is DNA sequence, h is motif position, y is binding affinity

Handwriting Recognition - MNISTx is raw image, y is digit, h is image rotation, use linear kernel

1 vs. 7 2 vs. 7 3 vs. 8 8 vs. 9

Noun Phrase Coreference – MUC6x consists of pairwise features between pairs of nounsy is a clustering of nounsh specifies a forest of nouns s.t. each tree is a cluster of nouns

Aim: To learn an accurate set of parameters for latent variable models

Okay… Got it!

Standard Learning Self-Paced Learning

Latent Variable Models

x

y

hx : input or observed variablesy : output or observed variablesh : hidden/latent variables

y = “Deer”

xh

Goal: Given D = {(x1,y1), …, (xn,yn)}, learn parameters w.

Expectation-Maximization for Maximum Likelihood

Minimize upper bound on risk minw ||w||2 + C·i maxy’,h’ [w·Ψ(xi,y’,h’) + Δ(yi,y’,h’)] - C·i maxh [w·Ψ (xi,yi,h)]Iterate:

• Impute hidden variables• Update weights to minimize upper bound on risk given these hidden variables

Latent Struct SVM [2]

Initialize K to be largeIterate:

Run inference over hAlternatively update w and v:

v set by sorting li(w), comparing to threshold 1/KPerform normal update for w over subset of data

Until convergenceAnneal K K/μ

Until all vi = 1, cannot reduce objective within tolerance

Easier subsets in early iterations, avoids learning from samples whose hidden variables are imputed incorrectly

Iteration 1 Iteration 3

Iteration 5 Iteration 7

minw r(w) + i li(w)

minw r(w) + i vili(w) – 1/K i vi

h = Bounding Box

Bengio et al. [1]: user-specified ordering

“Self-paced” schedule of examples is automatically set by learner

• task-specific• onerous on user

• “easy for human” “easy for computer”• “easy for Learner A” “easy for Learner B”

Training Error (%)

0

0.25

0.5

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Objective

4

4.25

4.5

4.75

5

Fold 1 Fold 2 Fold 3 Fold 4 Fold 50

5

10

15

20

25

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Test Error (%)

CCCP

SPL

Ob

ject

Cla

ssifi

cati

on

Moti

f Fin

din

g Discussion

Compare Self-Paced Learning to standard CCCP as in [2]

• Self-paced strategy outperforms state of the art• Global solvers for biconvex optimization may improve accuracy• Method is ideally suited to handle multiple levels of annotations

[1] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, 2009.[2] C.-N. Yu and T. Joachims. Learning structural SVMs with latent variables. In ICML, 2009.

152025303540

Protein 1Protein 2Protein 3Protein 4Protein 5

Training Error (%)

020406080

100120

Protein 1Protein 2Protein 3Protein 4Protein 5

Objective

152025303540

Protein 1Protein 2Protein 3Protein 4Protein 5

CCCP

SPL

Test Error (%)