Optimization of Machine Learning...

Optimization of Machine Learning Hyperparameters

Dr. Frank Hutter

Head of Emmy Noether Research Group on Learning, Optimization, and Automated Algorithm Design

Computer Science Institute University of Freiburg, Germany

July 2014

Motivation

• The machine learning algorithms you have learned about had several degrees of freedom

– E.g., in neural networks: regularization, momentum, learning rate, number of layers, number of units, …

• So far, how have you been setting these in practice?

– Changing one parameter at a time

– Grid search

• Was this tedious? Time-consuming?

– Imagine you have millions of data points and each evaluation takes hours or days…

2

High-level Learning Goals

• After this module, you can …

– Effectively use modern hyperparameter optimization methods

– Explain the concept of over-fitting

– Describe what measures can be taken to avoid over-fitting

– Describe the core mechanisms of several types of hyperparameter optimization methods

– Reason about the pros and cons of using a particular hyperparameter optimization method for a particular problem

– Derive the mechanisms behind Bayesian optimization

3

Outline of Today’s Class

• Generalization to previously unseen data

• Overview of hyperparameter optimization methods

• Foundations of Bayesian optimization: Bayesian linear regression & Gaussian processes

4

Learning and Generalization

• Much of supervised machine learning is about selecting a model from a given hypothesis space that

– Explains the seen data well

– Is likely to also work well for new data

• Example: Which model will describe new data better? The polynomial or the line?

5

Image source: Wikipedia

Occam’s razor (or Ockham’s razor)

“Numquam ponenda est pluralitas sine necessitate”

[Plurality must never be posited without necessity.]

• General problem solving principle

– In the absence of evidence to the contrary, prefer the simplest explanation.

– Adapted to machine learning: all things being equal prefer the simplest model.

6

William of Ockham, 1287-1347, philosopher

and theologian.


Occam’s razor in practice

• We need to trade off model complexity and model fit

• Model fit

– E.g., likelihood of the data under the model: P(data|model)

– In general: some loss of the predictor on the training data

• Model complexity

– E.g., number of free parameters

– E.g., number of effective dimensions

– E.g., VC dimension [Vapnik–Chervonenkis, 1971]

• Use regularization to penalize complex models: minimize training loss + C * regularization cost

7

Parameters vs. Hyperparameters

• Most machine learning algorithms optimize parameters under the hood

– E.g., weights in linear regression and neural networks

– E.g., deep learning: millions of parameters

• Standard approach: minimize training loss + C * regularization cost

– Using standard gradient-based optimizers

• Hyperparameters: decisions left to algorithm designer

– How complex a model to use?

– How to set C?

– How many layers/which structure of deep networks to use?

8

How to set the hyperparameters?

• We wish to achieve good generalization performance

• In practice, we need to try several values and empirically evaluate how well they generalize

– Train the model for a given hyperparameter setting

– Evaluate the model’s generalization performance

• Which data set should we use to evaluate the model’s generalization performance?

1. The same data set that we use all the time: all the data we have

2. We split the data we have available: use one part for training the model, another disjoint part for evaluating generalization

9

Interactive question

• Which data set should we use to evaluate a model’s generalization performance empirically?

– We split the data we have available: use one part for training the model, another disjoint part for evaluating performance

• Why?

– The assumption we make is that future data will come from the same ``true’’ distribution as our current data.

– Then, using an unseen sample of that distribution gives us an unbiased estimate of generalization to future data

– If our assumption is false, then we must control for concept drift … a topic for another lecture ;-)

10

Overfitting & early stopping heuristic

• Too little data / too little regularization:

– The error on the training data keeps on decreasing

– After too much training, the error on separate validation data starts to increase

• Early stopping heuristic: stop training at that point

11 Training time


Generalization of performance

• The dark ages

– Student tweaks hyperparameters until it works

– Supervisor may not even know about the tuning

– Results get published without acknowledging the tuning

– Of course, the approach does not generalize

• A step further

– Optimize parameters on a training set

– Evaluate generalization on a test set

• Another step further: avoid “peeking” at the test set

– Put test set into a vault (i.e., never look at it)

– Split training set again into training and validation set

– Only use test set in the end to generate results for publication 12

Training Validation Training Validation Training

Cross-validation for model selection

• Problem: single split of training data into training/validation might not be representative

• Standard solution: average performance across k cross-validation folds (here: k=3)

13

Training Validation

Training Validation

Cross-validation for model selection

• Standard model selection using cross-validation (CV):

• is a learning algorithm

• We apply to dataset and evaluate the resulting model on dataset

• We call the resulting loss

• We average these losses over the k cross-validation folds and pick the best-performing learning algorithm

14

Cross-validation for further tasks

• Standard model selection using cross-validation (CV):

• Standard hyperparameter optimization using CV:

• Combination of the two:

15

Cross-validation Details

• How to choose the number of folds k?

– Too low: noisy approximations of generalization

Poor generalization to test instances

– Too high: evaluating a configuration is expensive

Optimization process is slow Also, performance in folds is not independent, so increasing k does not always improve generalization

• Theory is lacking

• In practice, typically choose k=5 or k=10 [Kohavi, 1995]

• Practical speedup trick [Hutter, Hoos & Leyton-Brown, 2011]

– We do not need to evaluate all folds for each configuration

– Example: best configuration so far has average C/V error 0.1 based on 5 folds; new configuration has error 0.6 in first fold

16





17

Manual Search

Start with some configuration

repeat

Modify a single parameter

if performance on a benchmark set degrades then

undo modification

until no more improvement possible (or “good enough")

(manually-executed hill climbing)

18

Aka “Optimization by Graduate Student”

Pros and cons of manual search

• Pros

– Student gains some intuition helps understanding

– Student can notice irregularities, e.g.

• A configuration is worse than expected find bugs

• E.g., aliasing in filters learned by a convolutional network [Zeiler & Fergus, 2013]

• A run dies because of temporary file system errors repeat the run

• Cons

– “Blind” search: inefficient use of student’s time

– Sometimes “false intuition”: e.g., based on a different dataset and a different architecture a year ago

19

Simple Search Strategy: Grid Search

20

Image source: Bergstra et al, Random Search for Hyperparameter Optimization, JMLR 2012

• Select D values for each of N hyperparameters, try all DN combinations

• Direct feedback:

– Which values work/don’t work for each setting

– Which parameters are important? Are there interactions?

Simple Search Strategy: Random Search

• Select configurations uniformly at random

– Completely uninformed

– Global search, won’t get stuck in a local region

– Better than grid search for low effective dimensionality:

21

Image source: Bergstra et al, Random Search for Hyperparameter Optimization, JMLR 2012

Further Benefits of Random Search

• Perfect parallelizability

– Simply start K runs in parallel on a compute cluster

• Fault tolerance

– In practice, some runs often die because of some problem • File system error

• Parameter combination not legal

• Code crashes

– In grid search, you need the entire grid

– In random search, a design with M < K runs is also valid

22

Disadvantages of Random Search

• Entirely uninformed – Cannot follow an obvious gradient (e.g. bigger is better)

• Curse of dimensionality – Example: only ½ of the values of each dimensions is good

– Probability of randomly drawing a good configuration in N dimensions: 0.5N

• In 1 dimension: 0.5

• In 2 dimensions: 0.25

• In 10 dimensions: < 0.001

• In 20 dimensions: < 0.0000001

• Grid search has the same problems – Random search is the better search method

– Grid search only gives better intuitions

23

Stochastic Local Search

• Balance intensification and diversification

– Intensification: gradient descent

– Diversification: restarts, random steps, perturbations, …

• Prominent general methods

– Taboo search [Glover, 1986]

– Simulated annealing [Kirkpatrick, Gelatt, C. D.; Vecchi, 1983]

– Iterated local search [Lourenço, Martin & Stützle, 2003]

24

[e.g., Hoos and Stützle, 2005]

Population-based Methods

• Population of configurations

– Global + local search via population

– Maintain population fitness & diversity

• Examples

– Genetic algorithms [e.g., Barricelli, ’57, Goldberg, ’89]

– Evolutionary strategies [e.g., Beyer & Schwefel, ’02]

– Ant colony optimization [e.g., Dorigo & Stützle, ’04]

– Particle swarm optimization [e.g., Kennedy & Eberhart, ’95]

25

Bayesian Optimization

• Fit a (probabilistic) model of the function

• Use that model to trade off exploitation vs exploration

• Also known as sequential model-based optimization (SMBO)

26

Bayesian Optimization

• Popular approach in statistics to minimize expensive blackbox functions [Mockus, '78]

– Efficient in the number of function evaluations

– Works when objective is nonconvex, noisy, has unknown derivatives, etc

• Recent progress in the machine learning literature: global convergence rates for continuous optimization [Srinivas et al, ICML 2010] [Bull, JMLR 2011] [Bubeck et al., JMLR 2011] [de Freitas, Smola, Zoghi, ICML 2012]

27

Estimation of Distribution (EDA)

• Also uses a probabilistic model

• Also uses that model to inform where to evaluate next

• But models promising configurations: P(x is “good”)

– In contrast to modeling the function: P(f|x)

28


[e.g., Pelikan, Goldberg and Lobo, 2002]





29

Reminder: Bayesian Optimization

30

Aside: why is it called “Bayesian” ?

• Often you have causal knowledge

– For example • P(symptom | disease)

• P(observed noisy function values | true function)

– This is the likelihood: P(evidence e | hypothesis h)

• ... and you want to do evidential reasoning

– For example • P(disease | symptom)

• P(true function | observed noisy function values)

– This is the posterior: P(hypothesis h | evidence e)

• To compute this posterior, you also need – the prior P(hypothesis h) and Bayes rule

31

Bayes rule (or Bayes’ rule)

32

Thomas Bayes, 1701-1761, English statistician and philosopher. Image source: Wikipedia

Bayes rule in Bayesian optimization

• Denote the observed data as

• Denote our prior over functions as

• Then the posterior over functions is:

33

posterior likelihood prior

Two components of Bayesian optimization

• The probabilistic model

– Typically used: Gaussian process

– Today: Bayesian linear regression & Gaussian processes

– Next time: random forests

• The acquisition function

– Trades off exploration vs. exploitation

34

Bayesian linear regression & Gaussian processes

• Acknowledgement: The following slides are taken from Phillip Hennig’s tutorial on Gaussian processes in the machine learning summer school 2013

• All of Phillip’s slides are online: http://mlss.tuebingen.mpg.de/hennig_slides1.pdf

• Phillip’s website also has video lectures and more slides: http://www.is.tuebingen.mpg.de/nc/employee/details/phennig.html

35

http://mlss.tuebingen.mpg.de/hennig_slides1.pdf

http://mlss.tuebingen.mpg.de/hennig_slides1.pdf

http://www.is.tuebingen.mpg.de/nc/employee/details/phennig.html

http://www.is.tuebingen.mpg.de/nc/employee/details/phennig.html

Carl Friedrich Gauss (1777–1855)Paying Tolls with A Bell

f(x) = 1

σ√

2πe− (x−µ)22σ2

2 ,

The Gaussian distributionMultivariate Form

N (x;µ,Σ) = 1(2π)N/2∣Σ∣1/2 exp [−1

2(x − µ)⊺Σ−1(x − µ)]

−4 −2 0 µ1 4 6 8−4−20

µ2

4

6

8

▸ x,µ ∈ RN , Σ ∈ RN×N▸ Σ is positive semidefinite, i.e.

▸ v⊺Σv ≥ 0 for all v ∈ RN

▸ Hermitian, all eigenvalues ≥ 0

3 ,

Why Gaussian?an experiment

−0.1 −5 ⋅ 10−2 0 5 ⋅ 10−2 0.10

20

40

▸ nothing in the real world is Gaussian (except sums of i.i.d. variables)▸ But nothing in the real world is linear either!

Gaussians are for inference what linear maps are for algebra.

4 ,

Closure Under Multiplicationmultiple Gaussian factors form a Gaussian

N (x;a,A)N (x; b,B) = N (x; c,C)N (a; b,A +B)C ∶= (A−1 +B−1)−1 c ∶= C(A−1a +B−1b)

−4 −2 0 µ1 4 6 8−4−20

µ2

4

6

8

5 ,

Closure under Linear MapsLinear Maps of Gaussians are Gaussians

−4 −2 0 µ1 4 6 8−4−20

µ2

4

6

8

p(z) = N (z;µ,Σ)⇒ p(Az) = N (Az,Aµ,AΣA⊺)Here: A = [1,−0.5]

6 ,

Closure under Marginalizationprojections of Gaussians are Gaussian

▸ projection with A = (1 0)∫ N [(x

y) ;(µx

µy) ,(Σxx Σxy

Σyx Σyy)] dy = N (x;µx,Σxx)

−4 −2 0 µ1 4 6 8−4−20

µ2

4

6

8

▸ this is the sum rule

∫ p(x, y) dy = ∫ p(y ∣x)p(x) dy = p(x)▸ so every finite-dim Gaussian is a

marginal of infinitely many more

7 ,

Closure under Conditioningcuts through Gaussians are Gaussians

p(x ∣ y) = p(x, y)p(y) = N (x;µx +ΣxyΣ−1

yy(y − µy),Σxx −ΣxyΣ−1yyΣyx)

−4 −2 0 µ1 4 6 8−4−20

µ2

4

6

8

▸ this is the product rule▸ so Gaussians are closed under

the rules of probability

8 ,

Bayesian Inferenceexplaining away

0 5

0

5 p(x)= N (x;µ,Σ)= N [(x1

x2) ;( 1

0.5) ,(32 0

0 32)]

p(y ∣x, σ)= N (y;A⊺x;σ2)= N [6; (1 0.6)(x1

x2) , σ2]

p(x ∣σ2, y) = p(x)p(y ∣x)p(x)

= N (x;µ +ΣA(A⊺ΣA + σ2)−1(y −A⊺µ),Σ −ΣA(A⊺ΣA + σ2)−1A⊺Σ)= N [(x1

x2) ;(3.9

2.3) ,( 3.4 −3.4−3.4 7.0

)]

9 ,


0 5

0

5 p(x)= N (x;µ,Σ)= N [(x1

x2) ;( 1

0.5) ,(32 0

0 32)]p(y ∣x, σ)= N (y;A⊺x;σ2)

= N [6; (1 0.6)(x1

x2) , σ2]

p(x ∣σ2, y) = p(x)p(y ∣x)p(x)


x2) ;(3.9

2.3) ,( 3.4 −3.4−3.4 7.0

)]

9 ,


0 5

0

5 p(x)= N (x;µ,Σ)= N [(x1

x2) ;( 1

0.5) ,(32 0

0 32)]p(y ∣x, σ)= N (y;A⊺x;σ2)

= N [6; (1 0.6)(x1

x2) , σ2]

p(x ∣σ2, y) = p(x)p(y ∣x)p(x)


x2) ;(3.9

2.3) ,( 3.4 −3.4−3.4 7.0

)]9 ,

What can we do with this?linear regression

given y ∈ RN , p(y ∣ f), what’s f?

−8 −6 −4 −2 0 2 4 6 8

−100

10

20

x

y

10 ,

A priorover linear functions

f(x) = w1 +w2x = φ⊺xw p(w) = N (w;µ,Σ)φx = (1

x) p(f) = N (f ;φ⊺xµ,φ⊺xΣφx)

11 ,

A priorover linear functions

f(x) = w1 +w2x = φ⊺xw p(w) = N (w;µ,Σ)φx = (1

x) p(f) = N (f ;φ⊺xµ,φ⊺xΣφx)

12 ,

The posteriorover linear functions

p(y ∣w,φX) = N (y;φ⊺Xw,σ2I)p(w ∣ y, φX) = N (w;µ +ΣφX(φ⊺XΣφX + σ2I)−1(y − φ⊺Xµ),

Σ −ΣφX(φ⊺XΣφX + σ2I)−1φ⊺XΣ)φx

13 ,


p(y ∣w,φX) = N (y;φ⊺Xw,σ2I)p(fx ∣ y, φX) = N (fx;φ⊺xµ + φ⊺xΣφX(φ⊺XΣφX + σ2I)−1(y − φ⊺Xµ),

φ⊺xΣφx − φ⊺xΣφX(φ⊺XΣφX + σ2I)−1φ⊺XΣφx

13 ,

% prior on wF = 2; % number of featuresphi = @(a)(bsxfun(@power,a,0:F-1)); % φ(a) = [1;a]mu = zeros(F,1);Sigma = eye(F); % p(w) =N (µ,Σ)% prior on f(x)n = 100; x = linspace(-6,6,n)’; % ‘test’ pointsphix = phi(x); % features of xm = phix * mu;kxx = phix * Sigma * phix’; % p(fx) =N (m,kxx)s = bsxfun(@plus,m,chol(kxx + 1.0e-8 * eye(n))’ * randn(n,3)); % samples from priorstdpi = sqrt(diag(kxx)); % marginal stddev, for plotting

load(’data.mat’); N = length(Y); % gives Y,X,sigma

% prior on Y = fX + εphiX = phi(X); % features of dataM = phiX * mu;kXX = phiX * Sigma * phiX’; % p(fX) =N (M,kXX)G = kXX + sigma^2 * eye(N); % p(Y ) =N (M,kXX + σ2I)R = chol(G); % most expensive step: O(N3)kxX = phix * Sigma * phiX’; % cov(fx, fX) = kxXA = kxX / R; % pre-compute for re-use

mpost = m + A * (R’ \ (Y-M)); % p(fx ∣Y ) =N (m + kxX(kXX + σ2I)−1(Y −M),vpost = kxx - A * A’; % kxx − kxX(kXX + σ2I)−1kXx)spost = bsxfun(@plus,mpost,chol(vpost + 1.0e-8 * eye(n))’ * randn(n,3)); % samples

stdpo = sqrt(diag(vpost)); % marginal stddev, for plotting

14 ,

A More Realistic DatasetGeneral Linear Regression

f(x) = φ⊺xw ?

−8 −6 −4 −2 0 2 4 6 8

−10

0

10

20

x

y

15 ,

f(x) = w1 +w2x = φ⊺xw

φx ∶= (1x)

16 ,






17 ,

Cubic Regressionphi = @(a)(bsxfun(@power,a,[0:3]));

f(x) = φ(x)⊺w φ(x) = (1 x x.2 x.3)⊺

18 ,

Septic Regression ?phi = @(a)(bsxfun(@power,a,[0:7]));

f(x) = φ(x)⊺w φ(x) = (1 x x.2 ⋯ x.7)⊺

19 ,

Fourier Regressionphi = @(a)(2 * [cos(bsxfun(@times,a/8,[0:8])), sin(bsxfun(@times,a/8,[1:8]))]);

φ(x) = (cos(x) cos(2x) cos(3x) . . . sin(x) sin(2x) . . .)⊺

20 ,

Step Regressionphi = @(a)(-1 + 2 * bsxfun(@lt,a,linspace(-8,8,16)));

φ(x) = −1 + 2 (θ(x − 8) θ(8 − x) θ(x − 7) θ(7 − x) . . .)⊺

21 ,

V Regressionphi = @(a)(bsxfun(@minus,abs(bsxfun(@minus,a,linspace(-8,8,16))),linspace(-8,8,16)));

φ(x) = (∣x − 8∣ + 8 ∣x − 7∣ + 7 ∣x − 6∣ + 6 . . .)⊺

23 ,

Eiffel Tower Regressionphi = @(a)(exp(-abs(bsxfun(@minus,a,[-8:1:8]))));

φ(x) = (e−∣x−8∣ e−∣x−7∣ e−∣x−6∣ . . .)⊺

25 ,

Bell Curve Regressionphi = @(a)(exp(-0.5 * bsxfun(@minus,a,[-8:1:8]).^2));

φ(x) = (e− 12 (x−8)2 e− 1

2 (x−7)2 e− 12 (x−6)2 . . .)⊺

26 ,

Multiple Inputsall this works for in multiple dimensions, too

φ ∶ RN _R f ∶ RN _R

27 ,

Multiple Inputsall this works for in multiple dimensions, too

28 ,

How many features should we use?let’s look at that algebra again

p(fx ∣ y, φX) = N (fx;φ⊺xµ + φ⊺xΣφX(φ⊺XΣφX + σ2I)−1(y − φ⊺Xµ),φ⊺xΣφx − φ⊺xΣφX(φ⊺XΣφX + σ2I)−1φ⊺XΣφx)

▸ there’s no lonely φ in there▸ all objects involving φ are of the form

▸ φ⊺µ — the mean function▸ φ⊺Σφ — the kernel

▸ once these are known, cost is independent of the number of features▸ remember the code:

M = phiX * mu;m = phix * mu;kXX = phiX * Sigma * phiX’; % p(fX) =N (M,kXX)kxx = phix * Sigma * phix’; % p(fx) =N (m,kxx)kxX = phix * Sigma * phiX’; % cov(fx, fX) = kxX

32 ,






33 ,

% priorF = 2; % number of featuresphi = @(a)(bsxfun(@power,a,0:F)); % φ(a) = [1;a]k = @(a,b)(phi(a)’ * phi(b)); % kernelmu = @(a)(zeros(size(a,1))); % mean function

% belief on f(x)n = 100; x = linspace(-6,6,n)’; % ‘test’ pointsm = mu(x);kxx = k(x,x); % p(fx) =N (m,kxx)s = bsxfun(@plus,m,chol(kxx + 1.0e-8 * eye(n))’ * randn(n,3)); % samples from priorstdpi = sqrt(diag(kxx)); % marginal stddev, for plotting


% prior on Y = fX + εM = mu(X);kXX = k(X,X); % p(fX) =N (M,kXX)G = kXX + sigma^2 * eye(N); % p(Y ) =N (M,kXX + σ2I)R = chol(G); % most expensive step: O(N3)kxX = k(x,X); % cov(fx, fX) = kxXA = kxX / R; % pre-compute for re-use



34 ,

Exponentiated Squaresphi = @(a)(exp(-0.5 * bsxfun(@minus,a,linspace(-8,8,10)).^2 ./ell.^2));

▸ aka. radial basis function, square(d)-exponential kernel

37 ,

Exponentiated Squaresphi = @(a)(exp(-0.5 * bsxfun(@minus,a,linspace(-8,8,30)).^2 ./ell.^2));


37 ,

Exponentiated Squaresk = @(a,b)(5*exp(-0.25*bsxfun(@minus,a,b’).^2));


37 ,

What just happened?kernelization to infinitely many features

DefinitionA function k ∶ X ×X_R is a Mercer kernel if, for any finite collectionX = [x1, . . . , xN ], the matrix kXX ∈ RN×N with elementskXX,(i,j) = k(xi, xj) is positive semidefinite.

LemmaAny kernel that can be written as

k(x,x′) = ⨋ φ`(x)φ`(x′)d`is a Mercer kernel. (assuming integral over positive set)Proof: ∀X ∈ XN , v ∈ RNv⊺kXXv = ⨋ N∑

i

viφ`(xi)N∑j

vjφ`(xj)d` = ⨋ [∑i

viφ`(xi)]2

d` ≥ 0 ◻38 ,

What just happened?Gaussian process priors

DefinitionA function k ∶ X ×X_R is a Mercer kernel if, for any finite collectionX = [x1, . . . , xN ], the matrix kXX ∈ RN×N with elementskXX,(i,j) = k(xi, xj) is positive semidefinite.

DefinitionLet µ ∶ X_R be any function, k ∶ X ×X_R be a Mercer kernel.A Gaussian process p(f) = GP(f ;µ, k) is a probability distribution overthe function f ∶ X_R, such that every finite restriction to function valuesfX ∶= [fx1 , . . . , fxN ] is a Gaussian distribution p(fX) = N (fX ;µX , kXX).

39 ,


p(y ∣w,φX) = N (y;φ⊺Xw,σ2I)p(w ∣ y, φX) = N (w;µ +ΣφX(φ⊺XΣφX + σ2I)−1(y − φ⊺Xµ),

Σ −ΣφX(φ⊺XΣφX + σ2I)−1φ⊺XΣ)φx

13 ,