Learning to Warm-Start Bayesian Hyperparameter...

Post on 02-Jun-2020

4 views 0 download

Transcript of Learning to Warm-Start Bayesian Hyperparameter...

1/13

Learning to Warm-StartBayesian Hyperparameter Optimization

and Task-Adaptive Ensemble of Meta-Learnersfor Few-Shot Classification

Jungtaek Kim (jtkim@postech.ac.kr)

Machine Learning Group,Department of Computer Science and Engineering, POSTECH,

77 Cheongam-ro, Nam-gu, Pohang 37673,Gyeongsangbuk-do, Republic of Korea

September 11, 2018

2/13

Table of Contents

Learning to Warm-Start Bayesian Hyperparameter OptimizationMotivationMain ArchitectureExperiments

Task-Adaptive Ensemble of Meta-Learners for Few-Shot ClassificationMotivationMain ArchitectureExperiments

3/13

Learning to Warm-Start BayesianHyperparameter Optimization

4/13

Motivation

I Bayesian hyperparameter optimization usually starts fromrandom initial points.

I Better initializations might help to speed up Bayesianhyperparameter optimization.

I Mappings from hyperparameters to validation error are able tobe trained.

I We attempt to transfer prior knowledge about initializationsto new task.

5/13

Main Architecture

All weights are shared.

Meta-featureextractor

Meta-featureextractor

Dataset

Deep featureextractor

Dataset

Deep featureextractor

Meta-feature distance

fc layerfc layer

6/13

Experiments (EI)

0 5 10 15 20Iteration

0.76

0.78

0.80

0.82

Min

imum

valid

atio

ner

ror

(a) AwA2

0 5 10 15 20Iteration

0.56

0.58

0.60

0.62

Min

imum

valid

atio

ner

ror

(b) Caltech-101

0 5 10 15 20Iteration

0.84

0.85

0.86

0.87

0.88

Min

imum

valid

atio

ner

ror

(c) Caltech-256

0 5 10 15 20Iteration

0.30

0.35

0.40

0.45

Min

imum

valid

atio

ner

ror

(d) CIFAR-10

0 5 10 15 20Iteration

0.700

0.725

0.750

0.775

0.800

0.825

Min

imum

valid

atio

ner

ror

(e) CIFAR-100

0 5 10 15 20Iteration

0.960

0.965

0.970

0.975

Min

imum

valid

atio

ner

ror

(f) CUB200-2011

0 5 10 15 20Iteration

0.012

0.014

0.016

0.018

0.020

Min

imum

valid

atio

ner

ror

(g) MNIST

0 5 10 15 20Iteration

0.70

0.71

0.72

0.73

0.74

0.75

0.76

Min

imum

valid

atio

ner

ror

(h) VOC2012

Random init. (Uniform)

Random init. (Latin)

Random init. (Halton)

Nearest best init. (ADF)

Nearest best init. (Bi-LSTM)

7/13

Experiments (UCB)

0 5 10 15 20Iteration

0.76

0.78

0.80

0.82

Min

imum

valid

atio

ner

ror

(j) AwA2

0 5 10 15 20Iteration

0.56

0.58

0.60

0.62

Min

imum

valid

atio

ner

ror

(k) Caltech-101

0 5 10 15 20Iteration

0.84

0.85

0.86

0.87

0.88

Min

imum

valid

atio

ner

ror

(l) Caltech-256

0 5 10 15 20Iteration

0.30

0.35

0.40

0.45

Min

imum

valid

atio

ner

ror

(m) CIFAR-10

0 5 10 15 20Iteration

0.700

0.725

0.750

0.775

0.800

0.825

Min

imum

valid

atio

ner

ror

(n) CIFAR-100

0 5 10 15 20Iteration

0.960

0.965

0.970

0.975

Min

imum

valid

atio

ner

ror

(o) CUB200-2011

0 5 10 15 20Iteration

0.012

0.014

0.016

0.018

0.020

Min

imum

valid

atio

ner

ror

(p) MNIST

0 5 10 15 20Iteration

0.70

0.72

0.74

0.76

Min

imum

valid

atio

ner

ror

(q) VOC2012

Random init. (Uniform)

Random init. (Latin)

Random init. (Halton)

Nearest best init. (ADF)

Nearest best init. (Bi-LSTM)

8/13

Task-Adaptive Ensemble ofMeta-Learners for Few-Shot

Classification

9/13

Motivation

10/13

Motivation

I Few-shot classification needs to generalize training episodesand outperform in test episodes.

I Domain distribution of meta-learner for few-shot classificationis assumed not to be changed.

I In practice, domain distribution is able to be varied.

I We try to make ensemble of several meta-learners, each ofwhich is trained by the episodes from single dataset.

11/13

Main Architecture

12/13

Experiments

13/13

Experiments