Optimization of Machine Learning...
Transcript of Optimization of Machine Learning...
![Page 1: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/1.jpg)
Optimization of Machine Learning Hyperparameters
Dr. Frank Hutter
Head of Emmy Noether Research Group on Learning, Optimization, and Automated Algorithm Design
Computer Science Institute University of Freiburg, Germany
July 2014
![Page 2: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/2.jpg)
Motivation
• The machine learning algorithms you have learned about had several degrees of freedom
– E.g., in neural networks: regularization, momentum, learning rate, number of layers, number of units, …
• So far, how have you been setting these in practice?
– Changing one parameter at a time
– Grid search
• Was this tedious? Time-consuming?
– Imagine you have millions of data points and each evaluation takes hours or days…
2
![Page 3: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/3.jpg)
High-level Learning Goals
• After this module, you can …
– Effectively use modern hyperparameter optimization methods
– Explain the concept of over-fitting
– Describe what measures can be taken to avoid over-fitting
– Describe the core mechanisms of several types of hyperparameter optimization methods
– Reason about the pros and cons of using a particular hyperparameter optimization method for a particular problem
– Derive the mechanisms behind Bayesian optimization
3
![Page 4: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/4.jpg)
Outline of Today’s Class
• Generalization to previously unseen data
• Overview of hyperparameter optimization methods
• Foundations of Bayesian optimization: Bayesian linear regression & Gaussian processes
4
![Page 5: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/5.jpg)
Learning and Generalization
• Much of supervised machine learning is about selecting a model from a given hypothesis space that
– Explains the seen data well
– Is likely to also work well for new data
• Example: Which model will describe new data better? The polynomial or the line?
5
Image source: Wikipedia
![Page 6: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/6.jpg)
Occam’s razor (or Ockham’s razor)
“Numquam ponenda est pluralitas sine necessitate”
[Plurality must never be posited without necessity.]
• General problem solving principle
– In the absence of evidence to the contrary, prefer the simplest explanation.
– Adapted to machine learning: all things being equal prefer the simplest model.
6
William of Ockham, 1287-1347, philosopher
and theologian.
Image source: Wikipedia
![Page 7: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/7.jpg)
Occam’s razor in practice
• We need to trade off model complexity and model fit
• Model fit
– E.g., likelihood of the data under the model: P(data|model)
– In general: some loss of the predictor on the training data
• Model complexity
– E.g., number of free parameters
– E.g., number of effective dimensions
– E.g., VC dimension [Vapnik–Chervonenkis, 1971]
• Use regularization to penalize complex models: minimize training loss + C * regularization cost
7
![Page 8: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/8.jpg)
Parameters vs. Hyperparameters
• Most machine learning algorithms optimize parameters under the hood
– E.g., weights in linear regression and neural networks
– E.g., deep learning: millions of parameters
• Standard approach: minimize training loss + C * regularization cost
– Using standard gradient-based optimizers
• Hyperparameters: decisions left to algorithm designer
– How complex a model to use?
– How to set C?
– How many layers/which structure of deep networks to use?
8
![Page 9: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/9.jpg)
How to set the hyperparameters?
• We wish to achieve good generalization performance
• In practice, we need to try several values and empirically evaluate how well they generalize
– Train the model for a given hyperparameter setting
– Evaluate the model’s generalization performance
• Which data set should we use to evaluate the model’s generalization performance?
1. The same data set that we use all the time: all the data we have
2. We split the data we have available: use one part for training the model, another disjoint part for evaluating generalization
9
![Page 10: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/10.jpg)
Interactive question
• Which data set should we use to evaluate a model’s generalization performance empirically?
– We split the data we have available: use one part for training the model, another disjoint part for evaluating performance
• Why?
– The assumption we make is that future data will come from the same ``true’’ distribution as our current data.
– Then, using an unseen sample of that distribution gives us an unbiased estimate of generalization to future data
– If our assumption is false, then we must control for concept drift … a topic for another lecture ;-)
10
![Page 11: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/11.jpg)
Overfitting & early stopping heuristic
• Too little data / too little regularization:
– The error on the training data keeps on decreasing
– After too much training, the error on separate validation data starts to increase
• Early stopping heuristic: stop training at that point
11 Training time
Image source: Wikipedia
![Page 12: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/12.jpg)
Generalization of performance
• The dark ages
– Student tweaks hyperparameters until it works
– Supervisor may not even know about the tuning
– Results get published without acknowledging the tuning
– Of course, the approach does not generalize
• A step further
– Optimize parameters on a training set
– Evaluate generalization on a test set
• Another step further: avoid “peeking” at the test set
– Put test set into a vault (i.e., never look at it)
– Split training set again into training and validation set
– Only use test set in the end to generate results for publication 12
![Page 13: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/13.jpg)
Training Validation Training Validation Training
Cross-validation for model selection
• Problem: single split of training data into training/validation might not be representative
• Standard solution: average performance across k cross-validation folds (here: k=3)
13
Training Validation
Training Validation
![Page 14: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/14.jpg)
Cross-validation for model selection
• Standard model selection using cross-validation (CV):
• is a learning algorithm
• We apply to dataset and evaluate the resulting model on dataset
• We call the resulting loss
• We average these losses over the k cross-validation folds and pick the best-performing learning algorithm
14
![Page 15: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/15.jpg)
Cross-validation for further tasks
• Standard model selection using cross-validation (CV):
• Standard hyperparameter optimization using CV:
• Combination of the two:
15
![Page 16: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/16.jpg)
Cross-validation Details
• How to choose the number of folds k?
– Too low: noisy approximations of generalization
Poor generalization to test instances
– Too high: evaluating a configuration is expensive
Optimization process is slow Also, performance in folds is not independent, so increasing k does not always improve generalization
• Theory is lacking
• In practice, typically choose k=5 or k=10 [Kohavi, 1995]
• Practical speedup trick [Hutter, Hoos & Leyton-Brown, 2011]
– We do not need to evaluate all folds for each configuration
– Example: best configuration so far has average C/V error 0.1 based on 5 folds; new configuration has error 0.6 in first fold
16
![Page 17: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/17.jpg)
Outline of Today’s Class
• Generalization to previously unseen data
• Overview of hyperparameter optimization methods
• Foundations of Bayesian optimization: Bayesian linear regression & Gaussian processes
17
![Page 18: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/18.jpg)
Manual Search
Start with some configuration
repeat
Modify a single parameter
if performance on a benchmark set degrades then
undo modification
until no more improvement possible (or “good enough")
(manually-executed hill climbing)
18
Aka “Optimization by Graduate Student”
![Page 19: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/19.jpg)
Pros and cons of manual search
• Pros
– Student gains some intuition helps understanding
– Student can notice irregularities, e.g.
• A configuration is worse than expected find bugs
• E.g., aliasing in filters learned by a convolutional network [Zeiler & Fergus, 2013]
• A run dies because of temporary file system errors repeat the run
• Cons
– “Blind” search: inefficient use of student’s time
– Sometimes “false intuition”: e.g., based on a different dataset and a different architecture a year ago
19
![Page 20: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/20.jpg)
Simple Search Strategy: Grid Search
20
Image source: Bergstra et al, Random Search for Hyperparameter Optimization, JMLR 2012
• Select D values for each of N hyperparameters, try all DN combinations
• Direct feedback:
– Which values work/don’t work for each setting
– Which parameters are important? Are there interactions?
![Page 21: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/21.jpg)
Simple Search Strategy: Random Search
• Select configurations uniformly at random
– Completely uninformed
– Global search, won’t get stuck in a local region
– Better than grid search for low effective dimensionality:
21
Image source: Bergstra et al, Random Search for Hyperparameter Optimization, JMLR 2012
![Page 22: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/22.jpg)
Further Benefits of Random Search
• Perfect parallelizability
– Simply start K runs in parallel on a compute cluster
• Fault tolerance
– In practice, some runs often die because of some problem • File system error
• Parameter combination not legal
• Code crashes
– In grid search, you need the entire grid
– In random search, a design with M < K runs is also valid
22
![Page 23: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/23.jpg)
Disadvantages of Random Search
• Entirely uninformed – Cannot follow an obvious gradient (e.g. bigger is better)
• Curse of dimensionality – Example: only ½ of the values of each dimensions is good
– Probability of randomly drawing a good configuration in N dimensions: 0.5N
• In 1 dimension: 0.5
• In 2 dimensions: 0.25
• In 10 dimensions: < 0.001
• In 20 dimensions: < 0.0000001
• Grid search has the same problems – Random search is the better search method
– Grid search only gives better intuitions
23
![Page 24: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/24.jpg)
Stochastic Local Search
• Balance intensification and diversification
– Intensification: gradient descent
– Diversification: restarts, random steps, perturbations, …
• Prominent general methods
– Taboo search [Glover, 1986]
– Simulated annealing [Kirkpatrick, Gelatt, C. D.; Vecchi, 1983]
– Iterated local search [Lourenço, Martin & Stützle, 2003]
24
[e.g., Hoos and Stützle, 2005]
![Page 25: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/25.jpg)
Population-based Methods
• Population of configurations
– Global + local search via population
– Maintain population fitness & diversity
• Examples
– Genetic algorithms [e.g., Barricelli, ’57, Goldberg, ’89]
– Evolutionary strategies [e.g., Beyer & Schwefel, ’02]
– Ant colony optimization [e.g., Dorigo & Stützle, ’04]
– Particle swarm optimization [e.g., Kennedy & Eberhart, ’95]
25
![Page 26: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/26.jpg)
Bayesian Optimization
• Fit a (probabilistic) model of the function
• Use that model to trade off exploitation vs exploration
• Also known as sequential model-based optimization (SMBO)
26
![Page 27: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/27.jpg)
Bayesian Optimization
• Popular approach in statistics to minimize expensive blackbox functions [Mockus, '78]
– Efficient in the number of function evaluations
– Works when objective is nonconvex, noisy, has unknown derivatives, etc
• Recent progress in the machine learning literature: global convergence rates for continuous optimization [Srinivas et al, ICML 2010] [Bull, JMLR 2011] [Bubeck et al., JMLR 2011] [de Freitas, Smola, Zoghi, ICML 2012]
27
![Page 28: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/28.jpg)
Estimation of Distribution (EDA)
• Also uses a probabilistic model
• Also uses that model to inform where to evaluate next
• But models promising configurations: P(x is “good”)
– In contrast to modeling the function: P(f|x)
28
Image source: Wikipedia
[e.g., Pelikan, Goldberg and Lobo, 2002]
![Page 29: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/29.jpg)
Outline of Today’s Class
• Generalization to previously unseen data
• Overview of hyperparameter optimization methods
• Foundations of Bayesian optimization: Bayesian linear regression & Gaussian processes
29
![Page 30: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/30.jpg)
Reminder: Bayesian Optimization
30
![Page 31: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/31.jpg)
Aside: why is it called “Bayesian” ?
• Often you have causal knowledge
– For example • P(symptom | disease)
• P(observed noisy function values | true function)
– This is the likelihood: P(evidence e | hypothesis h)
• ... and you want to do evidential reasoning
– For example • P(disease | symptom)
• P(true function | observed noisy function values)
– This is the posterior: P(hypothesis h | evidence e)
• To compute this posterior, you also need – the prior P(hypothesis h) and Bayes rule
31
![Page 32: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/32.jpg)
Bayes rule (or Bayes’ rule)
32
Thomas Bayes, 1701-1761, English statistician and philosopher. Image source: Wikipedia
![Page 33: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/33.jpg)
Bayes rule in Bayesian optimization
• Denote the observed data as
• Denote our prior over functions as
• Then the posterior over functions is:
33
posterior likelihood prior
![Page 34: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/34.jpg)
Two components of Bayesian optimization
• The probabilistic model
– Typically used: Gaussian process
– Today: Bayesian linear regression & Gaussian processes
– Next time: random forests
• The acquisition function
– Trades off exploration vs. exploitation
34
![Page 35: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/35.jpg)
Bayesian linear regression & Gaussian processes
• Acknowledgement: The following slides are taken from Phillip Hennig’s tutorial on Gaussian processes in the machine learning summer school 2013
• All of Phillip’s slides are online: http://mlss.tuebingen.mpg.de/hennig_slides1.pdf
• Phillip’s website also has video lectures and more slides: http://www.is.tuebingen.mpg.de/nc/employee/details/phennig.html
35
![Page 36: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/36.jpg)
Carl Friedrich Gauss (1777–1855)Paying Tolls with A Bell
f(x) = 1
σ√
2πe− (x−µ)22σ2
2 ,
![Page 37: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/37.jpg)
The Gaussian distributionMultivariate Form
N (x;µ,Σ) = 1(2π)N/2∣Σ∣1/2 exp [−1
2(x − µ)⊺Σ−1(x − µ)]
−4 −2 0 µ1 4 6 8−4−20
µ2
4
6
8
▸ x,µ ∈ RN , Σ ∈ RN×N▸ Σ is positive semidefinite, i.e.
▸ v⊺Σv ≥ 0 for all v ∈ RN
▸ Hermitian, all eigenvalues ≥ 0
3 ,
![Page 38: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/38.jpg)
Why Gaussian?an experiment
−0.1 −5 ⋅ 10−2 0 5 ⋅ 10−2 0.10
20
40
▸ nothing in the real world is Gaussian (except sums of i.i.d. variables)▸ But nothing in the real world is linear either!
Gaussians are for inference what linear maps are for algebra.
4 ,
![Page 39: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/39.jpg)
Closure Under Multiplicationmultiple Gaussian factors form a Gaussian
N (x;a,A)N (x; b,B) = N (x; c,C)N (a; b,A +B)C ∶= (A−1 +B−1)−1 c ∶= C(A−1a +B−1b)
−4 −2 0 µ1 4 6 8−4−20
µ2
4
6
8
5 ,
![Page 40: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/40.jpg)
Closure Under Multiplicationmultiple Gaussian factors form a Gaussian
N (x;a,A)N (x; b,B) = N (x; c,C)N (a; b,A +B)C ∶= (A−1 +B−1)−1 c ∶= C(A−1a +B−1b)
−4 −2 0 µ1 4 6 8−4−20
µ2
4
6
8
5 ,
![Page 41: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/41.jpg)
Closure Under Multiplicationmultiple Gaussian factors form a Gaussian
N (x;a,A)N (x; b,B) = N (x; c,C)N (a; b,A +B)C ∶= (A−1 +B−1)−1 c ∶= C(A−1a +B−1b)
−4 −2 0 µ1 4 6 8−4−20
µ2
4
6
8
5 ,
![Page 42: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/42.jpg)
Closure under Linear MapsLinear Maps of Gaussians are Gaussians
−4 −2 0 µ1 4 6 8−4−20
µ2
4
6
8
p(z) = N (z;µ,Σ)⇒ p(Az) = N (Az,Aµ,AΣA⊺)Here: A = [1,−0.5]
6 ,
![Page 43: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/43.jpg)
Closure under Marginalizationprojections of Gaussians are Gaussian
▸ projection with A = (1 0)∫ N [(x
y) ;(µx
µy) ,(Σxx Σxy
Σyx Σyy)] dy = N (x;µx,Σxx)
−4 −2 0 µ1 4 6 8−4−20
µ2
4
6
8
▸ this is the sum rule
∫ p(x, y) dy = ∫ p(y ∣x)p(x) dy = p(x)▸ so every finite-dim Gaussian is a
marginal of infinitely many more
7 ,
![Page 44: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/44.jpg)
Closure under Conditioningcuts through Gaussians are Gaussians
p(x ∣ y) = p(x, y)p(y) = N (x;µx +ΣxyΣ−1
yy(y − µy),Σxx −ΣxyΣ−1yyΣyx)
−4 −2 0 µ1 4 6 8−4−20
µ2
4
6
8
▸ this is the product rule▸ so Gaussians are closed under
the rules of probability
8 ,
![Page 45: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/45.jpg)
Bayesian Inferenceexplaining away
0 5
0
5 p(x)= N (x;µ,Σ)= N [(x1
x2) ;( 1
0.5) ,(32 0
0 32)]
p(y ∣x, σ)= N (y;A⊺x;σ2)= N [6; (1 0.6)(x1
x2) , σ2]
p(x ∣σ2, y) = p(x)p(y ∣x)p(x)
= N (x;µ +ΣA(A⊺ΣA + σ2)−1(y −A⊺µ),Σ −ΣA(A⊺ΣA + σ2)−1A⊺Σ)= N [(x1
x2) ;(3.9
2.3) ,( 3.4 −3.4−3.4 7.0
)]
9 ,
![Page 46: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/46.jpg)
Bayesian Inferenceexplaining away
0 5
0
5 p(x)= N (x;µ,Σ)= N [(x1
x2) ;( 1
0.5) ,(32 0
0 32)]p(y ∣x, σ)= N (y;A⊺x;σ2)
= N [6; (1 0.6)(x1
x2) , σ2]
p(x ∣σ2, y) = p(x)p(y ∣x)p(x)
= N (x;µ +ΣA(A⊺ΣA + σ2)−1(y −A⊺µ),Σ −ΣA(A⊺ΣA + σ2)−1A⊺Σ)= N [(x1
x2) ;(3.9
2.3) ,( 3.4 −3.4−3.4 7.0
)]
9 ,
![Page 47: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/47.jpg)
Bayesian Inferenceexplaining away
0 5
0
5 p(x)= N (x;µ,Σ)= N [(x1
x2) ;( 1
0.5) ,(32 0
0 32)]p(y ∣x, σ)= N (y;A⊺x;σ2)
= N [6; (1 0.6)(x1
x2) , σ2]
p(x ∣σ2, y) = p(x)p(y ∣x)p(x)
= N (x;µ +ΣA(A⊺ΣA + σ2)−1(y −A⊺µ),Σ −ΣA(A⊺ΣA + σ2)−1A⊺Σ)= N [(x1
x2) ;(3.9
2.3) ,( 3.4 −3.4−3.4 7.0
)]
9 ,
![Page 48: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/48.jpg)
Bayesian Inferenceexplaining away
0 5
0
5 p(x)= N (x;µ,Σ)= N [(x1
x2) ;( 1
0.5) ,(32 0
0 32)]p(y ∣x, σ)= N (y;A⊺x;σ2)
= N [6; (1 0.6)(x1
x2) , σ2]
p(x ∣σ2, y) = p(x)p(y ∣x)p(x)
= N (x;µ +ΣA(A⊺ΣA + σ2)−1(y −A⊺µ),Σ −ΣA(A⊺ΣA + σ2)−1A⊺Σ)= N [(x1
x2) ;(3.9
2.3) ,( 3.4 −3.4−3.4 7.0
)]9 ,
![Page 49: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/49.jpg)
What can we do with this?linear regression
given y ∈ RN , p(y ∣ f), what’s f?
−8 −6 −4 −2 0 2 4 6 8
−100
10
20
x
y
10 ,
![Page 50: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/50.jpg)
A priorover linear functions
f(x) = w1 +w2x = φ⊺xw p(w) = N (w;µ,Σ)φx = (1
x) p(f) = N (f ;φ⊺xµ,φ⊺xΣφx)
11 ,
![Page 51: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/51.jpg)
A priorover linear functions
f(x) = w1 +w2x = φ⊺xw p(w) = N (w;µ,Σ)φx = (1
x) p(f) = N (f ;φ⊺xµ,φ⊺xΣφx)
12 ,
![Page 52: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/52.jpg)
The posteriorover linear functions
p(y ∣w,φX) = N (y;φ⊺Xw,σ2I)p(w ∣ y, φX) = N (w;µ +ΣφX(φ⊺XΣφX + σ2I)−1(y − φ⊺Xµ),
Σ −ΣφX(φ⊺XΣφX + σ2I)−1φ⊺XΣ)φx
13 ,
![Page 53: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/53.jpg)
The posteriorover linear functions
p(y ∣w,φX) = N (y;φ⊺Xw,σ2I)p(fx ∣ y, φX) = N (fx;φ⊺xµ + φ⊺xΣφX(φ⊺XΣφX + σ2I)−1(y − φ⊺Xµ),
φ⊺xΣφx − φ⊺xΣφX(φ⊺XΣφX + σ2I)−1φ⊺XΣφx
13 ,
![Page 54: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/54.jpg)
% prior on wF = 2; % number of featuresphi = @(a)(bsxfun(@power,a,0:F-1)); % φ(a) = [1;a]mu = zeros(F,1);Sigma = eye(F); % p(w) =N (µ,Σ)% prior on f(x)n = 100; x = linspace(-6,6,n)’; % ‘test’ pointsphix = phi(x); % features of xm = phix * mu;kxx = phix * Sigma * phix’; % p(fx) =N (m,kxx)s = bsxfun(@plus,m,chol(kxx + 1.0e-8 * eye(n))’ * randn(n,3)); % samples from priorstdpi = sqrt(diag(kxx)); % marginal stddev, for plotting
load(’data.mat’); N = length(Y); % gives Y,X,sigma
% prior on Y = fX + εphiX = phi(X); % features of dataM = phiX * mu;kXX = phiX * Sigma * phiX’; % p(fX) =N (M,kXX)G = kXX + sigma^2 * eye(N); % p(Y ) =N (M,kXX + σ2I)R = chol(G); % most expensive step: O(N3)kxX = phix * Sigma * phiX’; % cov(fx, fX) = kxXA = kxX / R; % pre-compute for re-use
mpost = m + A * (R’ \ (Y-M)); % p(fx ∣Y ) =N (m + kxX(kXX + σ2I)−1(Y −M),vpost = kxx - A * A’; % kxx − kxX(kXX + σ2I)−1kXx)spost = bsxfun(@plus,mpost,chol(vpost + 1.0e-8 * eye(n))’ * randn(n,3)); % samples
stdpo = sqrt(diag(vpost)); % marginal stddev, for plotting
14 ,
![Page 55: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/55.jpg)
A More Realistic DatasetGeneral Linear Regression
f(x) = φ⊺xw ?
−8 −6 −4 −2 0 2 4 6 8
−10
0
10
20
x
y
15 ,
![Page 56: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/56.jpg)
f(x) = w1 +w2x = φ⊺xw
φx ∶= (1x)
16 ,
![Page 57: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/57.jpg)
% prior on wF = 2; % number of featuresphi = @(a)(bsxfun(@power,a,0:F-1)); % φ(a) = [1;a]mu = zeros(F,1);Sigma = eye(F); % p(w) =N (µ,Σ)% prior on f(x)n = 100; x = linspace(-6,6,n)’; % ‘test’ pointsphix = phi(x); % features of xm = phix * mu;kxx = phix * Sigma * phix’; % p(fx) =N (m,kxx)s = bsxfun(@plus,m,chol(kxx + 1.0e-8 * eye(n))’ * randn(n,3)); % samples from priorstdpi = sqrt(diag(kxx)); % marginal stddev, for plotting
load(’data.mat’); N = length(Y); % gives Y,X,sigma
% prior on Y = fX + εphiX = phi(X); % features of dataM = phiX * mu;kXX = phiX * Sigma * phiX’; % p(fX) =N (M,kXX)G = kXX + sigma^2 * eye(N); % p(Y ) =N (M,kXX + σ2I)R = chol(G); % most expensive step: O(N3)kxX = phix * Sigma * phiX’; % cov(fx, fX) = kxXA = kxX / R; % pre-compute for re-use
mpost = m + A * (R’ \ (Y-M)); % p(fx ∣Y ) =N (m + kxX(kXX + σ2I)−1(Y −M),vpost = kxx - A * A’; % kxx − kxX(kXX + σ2I)−1kXx)spost = bsxfun(@plus,mpost,chol(vpost + 1.0e-8 * eye(n))’ * randn(n,3)); % samples
stdpo = sqrt(diag(vpost)); % marginal stddev, for plotting
17 ,
![Page 58: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/58.jpg)
Cubic Regressionphi = @(a)(bsxfun(@power,a,[0:3]));
f(x) = φ(x)⊺w φ(x) = (1 x x.2 x.3)⊺
18 ,
![Page 59: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/59.jpg)
Cubic Regressionphi = @(a)(bsxfun(@power,a,[0:3]));
f(x) = φ(x)⊺w φ(x) = (1 x x.2 x.3)⊺
18 ,
![Page 60: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/60.jpg)
Septic Regression ?phi = @(a)(bsxfun(@power,a,[0:7]));
f(x) = φ(x)⊺w φ(x) = (1 x x.2 ⋯ x.7)⊺
19 ,
![Page 61: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/61.jpg)
Septic Regression ?phi = @(a)(bsxfun(@power,a,[0:7]));
f(x) = φ(x)⊺w φ(x) = (1 x x.2 ⋯ x.7)⊺
19 ,
![Page 62: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/62.jpg)
Fourier Regressionphi = @(a)(2 * [cos(bsxfun(@times,a/8,[0:8])), sin(bsxfun(@times,a/8,[1:8]))]);
φ(x) = (cos(x) cos(2x) cos(3x) . . . sin(x) sin(2x) . . .)⊺
20 ,
![Page 63: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/63.jpg)
Fourier Regressionphi = @(a)(2 * [cos(bsxfun(@times,a/8,[0:8])), sin(bsxfun(@times,a/8,[1:8]))]);
φ(x) = (cos(x) cos(2x) cos(3x) . . . sin(x) sin(2x) . . .)⊺
20 ,
![Page 64: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/64.jpg)
Step Regressionphi = @(a)(-1 + 2 * bsxfun(@lt,a,linspace(-8,8,16)));
φ(x) = −1 + 2 (θ(x − 8) θ(8 − x) θ(x − 7) θ(7 − x) . . .)⊺
21 ,
![Page 65: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/65.jpg)
Step Regressionphi = @(a)(-1 + 2 * bsxfun(@lt,a,linspace(-8,8,16)));
φ(x) = −1 + 2 (θ(x − 8) θ(8 − x) θ(x − 7) θ(7 − x) . . .)⊺
21 ,
![Page 66: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/66.jpg)
V Regressionphi = @(a)(bsxfun(@minus,abs(bsxfun(@minus,a,linspace(-8,8,16))),linspace(-8,8,16)));
φ(x) = (∣x − 8∣ + 8 ∣x − 7∣ + 7 ∣x − 6∣ + 6 . . .)⊺
23 ,
![Page 67: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/67.jpg)
V Regressionphi = @(a)(bsxfun(@minus,abs(bsxfun(@minus,a,linspace(-8,8,16))),linspace(-8,8,16)));
φ(x) = (∣x − 8∣ + 8 ∣x − 7∣ + 7 ∣x − 6∣ + 6 . . .)⊺
23 ,
![Page 68: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/68.jpg)
Eiffel Tower Regressionphi = @(a)(exp(-abs(bsxfun(@minus,a,[-8:1:8]))));
φ(x) = (e−∣x−8∣ e−∣x−7∣ e−∣x−6∣ . . .)⊺
25 ,
![Page 69: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/69.jpg)
Eiffel Tower Regressionphi = @(a)(exp(-abs(bsxfun(@minus,a,[-8:1:8]))));
φ(x) = (e−∣x−8∣ e−∣x−7∣ e−∣x−6∣ . . .)⊺
25 ,
![Page 70: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/70.jpg)
Bell Curve Regressionphi = @(a)(exp(-0.5 * bsxfun(@minus,a,[-8:1:8]).^2));
φ(x) = (e− 12 (x−8)2 e− 1
2 (x−7)2 e− 12 (x−6)2 . . .)⊺
26 ,
![Page 71: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/71.jpg)
Bell Curve Regressionphi = @(a)(exp(-0.5 * bsxfun(@minus,a,[-8:1:8]).^2));
φ(x) = (e− 12 (x−8)2 e− 1
2 (x−7)2 e− 12 (x−6)2 . . .)⊺
26 ,
![Page 72: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/72.jpg)
Multiple Inputsall this works for in multiple dimensions, too
φ ∶ RN _R f ∶ RN _R
27 ,
![Page 73: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/73.jpg)
Multiple Inputsall this works for in multiple dimensions, too
28 ,
![Page 74: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/74.jpg)
How many features should we use?let’s look at that algebra again
p(fx ∣ y, φX) = N (fx;φ⊺xµ + φ⊺xΣφX(φ⊺XΣφX + σ2I)−1(y − φ⊺Xµ),φ⊺xΣφx − φ⊺xΣφX(φ⊺XΣφX + σ2I)−1φ⊺XΣφx)
▸ there’s no lonely φ in there▸ all objects involving φ are of the form
▸ φ⊺µ — the mean function▸ φ⊺Σφ — the kernel
▸ once these are known, cost is independent of the number of features▸ remember the code:
M = phiX * mu;m = phix * mu;kXX = phiX * Sigma * phiX’; % p(fX) =N (M,kXX)kxx = phix * Sigma * phix’; % p(fx) =N (m,kxx)kxX = phix * Sigma * phiX’; % cov(fx, fX) = kxX
32 ,
![Page 75: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/75.jpg)
% prior on wF = 2; % number of featuresphi = @(a)(bsxfun(@power,a,0:F-1)); % φ(a) = [1;a]mu = zeros(F,1);Sigma = eye(F); % p(w) =N (µ,Σ)% prior on f(x)n = 100; x = linspace(-6,6,n)’; % ‘test’ pointsphix = phi(x); % features of xm = phix * mu;kxx = phix * Sigma * phix’; % p(fx) =N (m,kxx)s = bsxfun(@plus,m,chol(kxx + 1.0e-8 * eye(n))’ * randn(n,3)); % samples from priorstdpi = sqrt(diag(kxx)); % marginal stddev, for plotting
load(’data.mat’); N = length(Y); % gives Y,X,sigma
% prior on Y = fX + εphiX = phi(X); % features of dataM = phiX * mu;kXX = phiX * Sigma * phiX’; % p(fX) =N (M,kXX)G = kXX + sigma^2 * eye(N); % p(Y ) =N (M,kXX + σ2I)R = chol(G); % most expensive step: O(N3)kxX = phix * Sigma * phiX’; % cov(fx, fX) = kxXA = kxX / R; % pre-compute for re-use
mpost = m + A * (R’ \ (Y-M)); % p(fx ∣Y ) =N (m + kxX(kXX + σ2I)−1(Y −M),vpost = kxx - A * A’; % kxx − kxX(kXX + σ2I)−1kXx)spost = bsxfun(@plus,mpost,chol(vpost + 1.0e-8 * eye(n))’ * randn(n,3)); % samples
stdpo = sqrt(diag(vpost)); % marginal stddev, for plotting
33 ,
![Page 76: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/76.jpg)
% priorF = 2; % number of featuresphi = @(a)(bsxfun(@power,a,0:F)); % φ(a) = [1;a]k = @(a,b)(phi(a)’ * phi(b)); % kernelmu = @(a)(zeros(size(a,1))); % mean function
% belief on f(x)n = 100; x = linspace(-6,6,n)’; % ‘test’ pointsm = mu(x);kxx = k(x,x); % p(fx) =N (m,kxx)s = bsxfun(@plus,m,chol(kxx + 1.0e-8 * eye(n))’ * randn(n,3)); % samples from priorstdpi = sqrt(diag(kxx)); % marginal stddev, for plotting
load(’data.mat’); N = length(Y); % gives Y,X,sigma
% prior on Y = fX + εM = mu(X);kXX = k(X,X); % p(fX) =N (M,kXX)G = kXX + sigma^2 * eye(N); % p(Y ) =N (M,kXX + σ2I)R = chol(G); % most expensive step: O(N3)kxX = k(x,X); % cov(fx, fX) = kxXA = kxX / R; % pre-compute for re-use
mpost = m + A * (R’ \ (Y-M)); % p(fx ∣Y ) =N (m + kxX(kXX + σ2I)−1(Y −M),vpost = kxx - A * A’; % kxx − kxX(kXX + σ2I)−1kXx)spost = bsxfun(@plus,mpost,chol(vpost + 1.0e-8 * eye(n))’ * randn(n,3)); % samples
stdpo = sqrt(diag(vpost)); % marginal stddev, for plotting
34 ,
![Page 77: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/77.jpg)
Exponentiated Squaresphi = @(a)(exp(-0.5 * bsxfun(@minus,a,linspace(-8,8,10)).^2 ./ell.^2));
▸ aka. radial basis function, square(d)-exponential kernel
37 ,
![Page 78: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/78.jpg)
Exponentiated Squaresphi = @(a)(exp(-0.5 * bsxfun(@minus,a,linspace(-8,8,30)).^2 ./ell.^2));
▸ aka. radial basis function, square(d)-exponential kernel
37 ,
![Page 79: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/79.jpg)
Exponentiated Squaresk = @(a,b)(5*exp(-0.25*bsxfun(@minus,a,b’).^2));
▸ aka. radial basis function, square(d)-exponential kernel
37 ,
![Page 80: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/80.jpg)
Exponentiated Squaresk = @(a,b)(5*exp(-0.25*bsxfun(@minus,a,b’).^2));
▸ aka. radial basis function, square(d)-exponential kernel
37 ,
![Page 81: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/81.jpg)
What just happened?kernelization to infinitely many features
DefinitionA function k ∶ X ×X_R is a Mercer kernel if, for any finite collectionX = [x1, . . . , xN ], the matrix kXX ∈ RN×N with elementskXX,(i,j) = k(xi, xj) is positive semidefinite.
LemmaAny kernel that can be written as
k(x,x′) = ⨋ φ`(x)φ`(x′)d`is a Mercer kernel. (assuming integral over positive set)Proof: ∀X ∈ XN , v ∈ RNv⊺kXXv = ⨋ N∑
i
viφ`(xi)N∑j
vjφ`(xj)d` = ⨋ [∑i
viφ`(xi)]2
d` ≥ 0 ◻38 ,
![Page 82: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/82.jpg)
What just happened?Gaussian process priors
DefinitionA function k ∶ X ×X_R is a Mercer kernel if, for any finite collectionX = [x1, . . . , xN ], the matrix kXX ∈ RN×N with elementskXX,(i,j) = k(xi, xj) is positive semidefinite.
DefinitionLet µ ∶ X_R be any function, k ∶ X ×X_R be a Mercer kernel.A Gaussian process p(f) = GP(f ;µ, k) is a probability distribution overthe function f ∶ X_R, such that every finite restriction to function valuesfX ∶= [fx1 , . . . , fxN ] is a Gaussian distribution p(fX) = N (fX ;µX , kXX).
39 ,
![Page 83: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/83.jpg)
The posteriorover linear functions
p(y ∣w,φX) = N (y;φ⊺Xw,σ2I)p(w ∣ y, φX) = N (w;µ +ΣφX(φ⊺XΣφX + σ2I)−1(y − φ⊺Xµ),
Σ −ΣφX(φ⊺XΣφX + σ2I)−1φ⊺XΣ)φx
13 ,
![Page 84: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/84.jpg)
The posteriorover linear functions
p(y ∣w,φX) = N (y;φ⊺Xw,σ2I)p(fx ∣ y, φX) = N (fx;φ⊺xµ + φ⊺xΣφX(φ⊺XΣφX + σ2I)−1(y − φ⊺Xµ),
φ⊺xΣφx − φ⊺xΣφX(φ⊺XΣφX + σ2I)−1φ⊺XΣφx
13 ,
![Page 85: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/85.jpg)
Cubic Regressionphi = @(a)(bsxfun(@power,a,[0:3]));
f(x) = φ(x)⊺w φ(x) = (1 x x.2 x.3)⊺
18 ,
![Page 86: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/86.jpg)
Septic Regression ?phi = @(a)(bsxfun(@power,a,[0:7]));
f(x) = φ(x)⊺w φ(x) = (1 x x.2 ⋯ x.7)⊺
19 ,
![Page 87: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/87.jpg)
Fourier Regressionphi = @(a)(2 * [cos(bsxfun(@times,a/8,[0:8])), sin(bsxfun(@times,a/8,[1:8]))]);
φ(x) = (cos(x) cos(2x) cos(3x) . . . sin(x) sin(2x) . . .)⊺
20 ,
![Page 88: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/88.jpg)
Step Regressionphi = @(a)(-1 + 2 * bsxfun(@lt,a,linspace(-8,8,16)));
φ(x) = −1 + 2 (θ(x − 8) θ(8 − x) θ(x − 7) θ(7 − x) . . .)⊺
21 ,
![Page 89: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/89.jpg)
V Regressionphi = @(a)(bsxfun(@minus,abs(bsxfun(@minus,a,linspace(-8,8,16))),linspace(-8,8,16)));
φ(x) = (∣x − 8∣ + 8 ∣x − 7∣ + 7 ∣x − 6∣ + 6 . . .)⊺
23 ,
![Page 90: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/90.jpg)
Eiffel Tower Regressionphi = @(a)(exp(-abs(bsxfun(@minus,a,[-8:1:8]))));
φ(x) = (e−∣x−8∣ e−∣x−7∣ e−∣x−6∣ . . .)⊺
25 ,
![Page 91: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/91.jpg)
Bell Curve Regressionphi = @(a)(exp(-0.5 * bsxfun(@minus,a,[-8:1:8]).^2));
φ(x) = (e− 12 (x−8)2 e− 1
2 (x−7)2 e− 12 (x−6)2 . . .)⊺
26 ,
![Page 92: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/92.jpg)
Exponentiated Squaresk = @(a,b)(5*exp(-0.25*bsxfun(@minus,a,b’).^2));
▸ aka. radial basis function, square(d)-exponential kernel
37 ,
![Page 93: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/93.jpg)
The predictive posterior distribution
The posterior Gaussian process has predictive distribution , where
36
![Page 94: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/94.jpg)
The predictive posterior under noise
The posterior Gaussian process has predictive distribution , where
37
![Page 95: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/95.jpg)
Computational complexity of GPs
• Let t denote the number of data points in the GP
• Inverting the kernel matrix: O(t3)
• Predictions of the variance: O(t2)
• Predictions of the mean: O(t)
38
The posterior Gaussian process has predictive distribution , where
38
![Page 96: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/96.jpg)
Two components of Bayesian optimization
• The probabilistic model
– Typically used: Gaussian process
– Later: other models are possible, e.g., random forests
• The acquisition function
– Trades off exploration vs. exploitation
– We’ll discuss this in detail
39
![Page 97: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/97.jpg)
Probability of Improvement
40
![Page 98: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/98.jpg)
Expected Improvement
41
(the derivation of this integral’s closed-form solution will be an exercise)
![Page 99: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/99.jpg)
Upper Confidence Bound (UBC)
42
![Page 100: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/100.jpg)
Entropy Search
• Compute a probability distribution over which configuration is optimal
• Acquisition function: try to push this probability distribution as close to a delta distribution as possible
• One of the most powerful acquisition functions
– Can choose to actively evaluate in one region of the space to learning something about a different region of the space
43
![Page 101: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/101.jpg)
Putting it all Together
• How to optimize the acquisition function?
– Subsidiary optimization method
– Important: in that subsidiary optimization, function evaluations are cheap (just evaluations of the GP).
44
![Page 102: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/102.jpg)
Summary of Bayesian Optimization
• Bayesian optimization integrates
– prior information and
– the likelihood of the observed data
• It uses quite involved computation to select which function value to evaluate next
– Thus, it’s most useful for expensive blackbox functions
45
![Page 103: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random](https://reader031.fdocuments.in/reader031/viewer/2022022600/5b3fb2ee7f8b9a5e528c7235/html5/thumbnails/103.jpg)
Overall summary
• Generalization: we need to safeguard against over-fitting
• Overview over Hyperparameter optimization methods
• Bayesian optimization
– Based on linear regression & Gaussian processes
• Next week:
– Bayesian optimization with random forests
– Extensions and applications
46