Javier Gonzalez´ (most of slides today: Neil Lawrence)javiergonzalezh.github.io › courseBO ›...
Transcript of Javier Gonzalez´ (most of slides today: Neil Lawrence)javiergonzalezh.github.io › courseBO ›...
-
Course in Bayesian Optimization
Javier González(most of slides today: Neil Lawrence)
University of Sheffield, Sheffield, UK
27th October 2015
-
Many thanks to
I Mauricio Álvarez.I Cristian Guarizno.I Neil Lawrence, University of Sheffield.I Zhenwen Dai, University of Sheffield.I Machine Learning group, University of Sheffield.I Philipp Hennig, Max Planck institute.I Michael Osborne, University of Oxford.
-
Outline of the Course
I Lecture 1: Uncertainty and Gaussian Processes.
I Lecture 2: Introduction to Bayesian (probabilistic)optimization.
I Lecture 3: Advanced topics in Bayesian Optimization.
-
Outline of the Course
I Lab 1: Introduction to GPy.
I Lab 2: Introduction to GPyOpt.
I Lab 3: Advanced GPyOpt.
I Day 4: Projects + presentations.
-
Points of the day
What is machine learning?
What is the uncertainty? Types?
How the uncertainty plays a role in the learning process.
Gaussian processes as models to handle uncertainty.
-
Points of the day
What is machine learning?
What is the uncertainty? Types?
How the uncertainty plays a role in the learning process.
Gaussian processes as models to handle uncertainty.
-
Points of the day
What is machine learning?
What is the uncertainty? Types?
How the uncertainty plays a role in the learning process.
Gaussian processes as models to handle uncertainty.
-
Points of the day
What is machine learning?
What is the uncertainty? Types?
How the uncertainty plays a role in the learning process.
Gaussian processes as models to handle uncertainty.
-
Points of the day
What is machine learning?
What is the uncertainty? Types?
How the uncertainty plays a role in the learning process.
Gaussian processes as models to handle uncertainty.
-
What is to learn?
-
The human learning process?
Cancerous Tumor?
-
What is Machine Learning?
data
+ model = prediction
I data: observations, could be actively or passively acquired(meta-data).
I model: assumptions, based on previous experience (otherdata! transfer learning etc), or beliefs about the regularitiesof the universe. Inductive bias.
I prediction: an action to be taken or a categorization or aquality score.
-
What is Machine Learning?
data +
model = prediction
I data: observations, could be actively or passively acquired(meta-data).
I model: assumptions, based on previous experience (otherdata! transfer learning etc), or beliefs about the regularitiesof the universe. Inductive bias.
I prediction: an action to be taken or a categorization or aquality score.
-
What is Machine Learning?
data + model
= prediction
I data: observations, could be actively or passively acquired(meta-data).
I model: assumptions, based on previous experience (otherdata! transfer learning etc), or beliefs about the regularitiesof the universe. Inductive bias.
I prediction: an action to be taken or a categorization or aquality score.
-
What is Machine Learning?
data + model =
prediction
I data: observations, could be actively or passively acquired(meta-data).
I model: assumptions, based on previous experience (otherdata! transfer learning etc), or beliefs about the regularitiesof the universe. Inductive bias.
I prediction: an action to be taken or a categorization or aquality score.
-
What is Machine Learning?
data + model = prediction
I data: observations, could be actively or passively acquired(meta-data).
I model: assumptions, based on previous experience (otherdata! transfer learning etc), or beliefs about the regularitiesof the universe. Inductive bias.
I prediction: an action to be taken or a categorization or aquality score.
-
Historical Perspective
I A data driven approach to Artificial Intelligence.I Inspired by attempts to model the brain (the
connectionists).I A community that transcended traditional boundaries
(psychology, statistical physics, signal processing)I Led to an approach that dominates in the modern data-rich
world.
-
Two Dominant Approaches
I Machine Learning as Optimization:I Formulate your learning Problem as an optimization
problem.I Typically intractable, so minimize a relaxed version of the
cost function.I Prove characteristics of the resulting solution.
I Machine Learning as Probabilistic Modelling:I Formulate your learning problem as a probabilistic model.I Relate variables through probability distributions.I If Bayesian, treat parameters with probability distributions.I Required integrals often intractable: use approximations
(MCMC, variational etc).
-
Two Dominant Approaches
I Machine Learning as Optimization:I Formulate your learning Problem as an optimization
problem.I Typically intractable, so minimize a relaxed version of the
cost function.I Prove characteristics of the resulting solution.
I Machine Learning as Probabilistic Modelling:I Formulate your learning problem as a probabilistic model.I Relate variables through probability distributions.I If Bayesian, treat parameters with probability distributions.I Required integrals often intractable: use approximations
(MCMC, variational etc).
-
Two Dominant Approaches
I Machine Learning as Optimization:I Formulate your learning Problem as an optimization
problem.I Typically intractable, so minimize a relaxed version of the
cost function.I Prove characteristics of the resulting solution.
I Machine Learning as Probabilistic Modelling:I Formulate your learning problem as a probabilistic model.I Relate variables through probability distributions.I If Bayesian, treat parameters with probability distributions.I Required integrals often intractable: use approximations
(MCMC, variational etc).
-
Two Dominant Approaches
I Machine Learning as Optimization:I Formulate your learning Problem as an optimization
problem.I Typically intractable, so minimize a relaxed version of the
cost function.I Prove characteristics of the resulting solution.
I Machine Learning as Probabilistic Modelling:I Formulate your learning problem as a probabilistic model.I Relate variables through probability distributions.I If Bayesian, treat parameters with probability distributions.I Required integrals often intractable: use approximations
(MCMC, variational etc).
-
Two Dominant Approaches
I Machine Learning as Optimization:I Formulate your learning Problem as an optimization
problem.I Typically intractable, so minimize a relaxed version of the
cost function.I Prove characteristics of the resulting solution.
I Machine Learning as Probabilistic Modelling:I Formulate your learning problem as a probabilistic model.I Relate variables through probability distributions.I If Bayesian, treat parameters with probability distributions.I Required integrals often intractable: use approximations
(MCMC, variational etc).
-
Two Dominant Approaches
I Machine Learning as Optimization:I Formulate your learning Problem as an optimization
problem.I Typically intractable, so minimize a relaxed version of the
cost function.I Prove characteristics of the resulting solution.
I Machine Learning as Probabilistic Modelling:I Formulate your learning problem as a probabilistic model.I Relate variables through probability distributions.I If Bayesian, treat parameters with probability distributions.I Required integrals often intractable: use approximations
(MCMC, variational etc).
-
Two Dominant Approaches
I Machine Learning as Optimization:I Formulate your learning Problem as an optimization
problem.I Typically intractable, so minimize a relaxed version of the
cost function.I Prove characteristics of the resulting solution.
I Machine Learning as Probabilistic Modelling:I Formulate your learning problem as a probabilistic model.I Relate variables through probability distributions.I If Bayesian, treat parameters with probability distributions.I Required integrals often intractable: use approximations
(MCMC, variational etc).
-
Two Dominant Approaches
I Machine Learning as Optimization:I Formulate your learning Problem as an optimization
problem.I Typically intractable, so minimize a relaxed version of the
cost function.I Prove characteristics of the resulting solution.
I Machine Learning as Probabilistic Modelling:I Formulate your learning problem as a probabilistic model.I Relate variables through probability distributions.I If Bayesian, treat parameters with probability distributions.I Required integrals often intractable: use approximations
(MCMC, variational etc).
-
Two Dominant Approaches
I Machine Learning as Optimization:I Formulate your learning Problem as an optimization
problem.I Typically intractable, so minimize a relaxed version of the
cost function.I Prove characteristics of the resulting solution.
I Machine Learning as Probabilistic Modelling:I Formulate your learning problem as a probabilistic model.I Relate variables through probability distributions.I If Bayesian, treat parameters with probability distributions.I Required integrals often intractable: use approximations
(MCMC, variational etc).
-
Modelling Assumptions
I Modelling assumptions are either included as:I a regularizer (optimization) orI in the probability distribution (probabilistic approach).
I Typical assumptions: sparsity, smoothness.
-
Modelling Assumptions
I Modelling assumptions are either included as:I a regularizer (optimization) orI in the probability distribution (probabilistic approach).
I Typical assumptions: sparsity, smoothness.
-
Modelling Assumptions
I Modelling assumptions are either included as:I a regularizer (optimization) orI in the probability distribution (probabilistic approach).
I Typical assumptions: sparsity, smoothness.
-
Modelling Assumptions
I Modelling assumptions are either included as:I a regularizer (optimization) orI in the probability distribution (probabilistic approach).
I Typical assumptions: sparsity, smoothness.
-
Applications of Machine Learning
Handwriting Recognition : Recognising handwrittencharacters. For example LeNethttp://bit.ly/d26fwK.
Friend Indentification : Suggesting friends on social networkshttps:
//www.facebook.com/help/501283333222485
Ranking : Learning relative skills of on line game players,the TrueSkill system http://research.microsoft.com/en-us/projects/trueskill/.http://www.netflixprize.com/.
Internet Search : For example Ad Click Through rateprediction http://bit.ly/a7XLH4.
News Personalisation : For example Zitehttp://www.zite.com/.
Game Play Learning : For example, learning to play Gohttp://bit.ly/cV77zM.
http://bit.ly/d26fwKhttps://www.facebook.com/help/501283333222485https://www.facebook.com/help/501283333222485http://research.microsoft.com/en-us/projects/trueskill/http://research.microsoft.com/en-us/projects/trueskill/http://www.netflixprize.com/http://bit.ly/a7XLH4http://www.zite.com/http://bit.ly/cV77zM
-
Learning is Optimization
-
y = mx + c
-
0
1
2
3
4
5
0 1 2 3 4 5
y
x
y = mx + c
-
0
1
2
3
4
5
0 1 2 3 4 5
y
x
y = mx + cc
m
-
0
1
2
3
4
5
0 1 2 3 4 5
y
x
y = mx + cc
m
-
0
1
2
3
4
5
0 1 2 3 4 5
y
x
y = mx + cc
m
-
0
1
2
3
4
5
0 1 2 3 4 5
y
x
y = mx + c
-
0
1
2
3
4
5
0 1 2 3 4 5
y
x
y = mx + c
-
0
1
2
3
4
5
0 1 2 3 4 5
y
x
y = mx + c
-
y = mx + c
point 1: x = 1, y = 3
3 = m + c
point 2: x = 3, y = 1
1 = 3m + c
point 3: x = 2, y = 2.5
2.5 = 2m + c
-
y = mx + c + �
point 1: x = 1, y = 3
3 = m + c + �1
point 2: x = 3, y = 1
1 = 3m + c + �2
point 3: x = 2, y = 2.5
2.5 = 2m + c + �3
-
Regression Revisited
I We introduce an error function of the form
E(w) =n∑
i=1
(yi −mxi − c
)2I Minimize the error function with respect to m and c
-
Mathematical Interpretation
I What is the mathematical interpretation?I There is a cost function.I It expresses mismatch between your prediction and reality.
E(m, c) =n∑
i=1
(yi −mxi − c
)2I This is known as the sum of squares error.
-
Learning is Optimization
I Learning is minimization of the cost function.I At the minima the gradient is zero.I Coordinate ascent, find gradient in each coordinate and set
to zero.dE(m)
dm= −2
n∑i=1
xi(yi −mxi − c
)
-
Learning is Optimization
I Learning is minimization of the cost function.I At the minima the gradient is zero.I Coordinate ascent, find gradient in each coordinate and set
to zero.
0 = −2n∑
i=1
xi(yi −mxi − c
)
-
Learning is Optimization
I Learning is minimization of the cost function.I At the minima the gradient is zero.I Coordinate ascent, find gradient in each coordinate and set
to zero.
0 = −2n∑
i=1
xiyi + 2n∑
i=1
mx2i + 2n∑
i=1
cxi
-
Learning is Optimization
I Learning is minimization of the cost function.I At the minima the gradient is zero.I Coordinate ascent, find gradient in each coordinate and set
to zero.
m =∑n
i=1(yi − c
)xi∑n
i=1 x2i
-
Learning is Optimization
I Learning is minimization of the cost function.I At the minima the gradient is zero.I Coordinate ascent, find gradient in each coordinate and set
to zero.dE(c)
dc= −2
n∑i=1
(yi −mxi − c
)
-
Learning is Optimization
I Learning is minimization of the cost function.I At the minima the gradient is zero.I Coordinate ascent, find gradient in each coordinate and set
to zero.
0 = −2n∑
i=1
(yi −mxi − c
)
-
Learning is Optimization
I Learning is minimization of the cost function.I At the minima the gradient is zero.I Coordinate ascent, find gradient in each coordinate and set
to zero.
0 = −2n∑
i=1
yi + 2n∑
i=1
mxi + 2nc
-
Learning is Optimization
I Learning is minimization of the cost function.I At the minima the gradient is zero.I Coordinate ascent, find gradient in each coordinate and set
to zero.
c =∑n
i=1(yi −mxi
)n
-
Fixed Point Updates
Worked example.
c∗ =∑n
i=1(yi −m∗xi
)n
,
m∗ =∑n
i=1 xi(yi − c∗
)∑ni=1 x
2i
,
σ2∗
=
∑ni=1
(yi −m∗xi − c∗
)2n
-
Coordinate Descent
-1
-0.5
0
0.5
1
1.5
2
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1
c
m
E(m, c)
-
Coordinate Descent
-1
-0.5
0
0.5
1
1.5
2
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1
c
m
Iteration 1
-
Coordinate Descent
-1
-0.5
0
0.5
1
1.5
2
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1
c
m
Iteration 1
-
Coordinate Descent
-1
-0.5
0
0.5
1
1.5
2
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1
c
m
Iteration 2
-
Coordinate Descent
-1
-0.5
0
0.5
1
1.5
2
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1
c
m
Iteration 2
-
Coordinate Descent
-1
-0.5
0
0.5
1
1.5
2
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1
c
m
Iteration 3
-
Coordinate Descent
-1
-0.5
0
0.5
1
1.5
2
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1
c
m
Iteration 3
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 4
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 4
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 5
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 5
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 6
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 6
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 7
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 7
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 8
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 8
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 9
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 9
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 30
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 30
-
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 30
-
Important Concepts Not Covered
I Optimization methods.I Second order methods, conjugate gradient, quasi-Newton
and Newton.I Effective heuristics such as momentum, CMA, etc
I Local vs global solutions (Bayesian optimization!).
-
Learning is probabilistic modeling
-
Machine Learning and Probability
I The world is an uncertain place.Epistemic uncertainty: uncertainty arising through lack of
knowledge. (What colour socks is that personwearing?)
Aleatoric uncertainty: uncertainty arising through anunderlying stochastic system. (Where will asheet of paper fall if I drop it?)
-
Machine Learning and Probability
I The world is an uncertain place.Epistemic uncertainty: uncertainty arising through lack of
knowledge. (What colour socks is that personwearing?)
Aleatoric uncertainty: uncertainty arising through anunderlying stochastic system. (Where will asheet of paper fall if I drop it?)
-
Machine Learning and Probability
I The world is an uncertain place.Epistemic uncertainty: uncertainty arising through lack of
knowledge. (What colour socks is that personwearing?)
Aleatoric uncertainty: uncertainty arising through anunderlying stochastic system. (Where will asheet of paper fall if I drop it?)
-
Probability: A Framework to Characterise Uncertainty
I We need a framework to characterise the uncertainty.I In this course we make use of probability theory to
characterise uncertainty.
-
Probability: A Framework to Characterise Uncertainty
I We need a framework to characterise the uncertainty.I In this course we make use of probability theory to
characterise uncertainty.
-
Richard Price
I Welsh philosopher and essay writer.I Edited Thomas Bayes’s essay which contained
foundations of Bayesian philosophy.
Figure: Richard Price, 1723–1791. (source Wikipedia)
-
Laplace
I French Mathematician and Astronomer.
Figure: Pierre-Simon Laplace, 1749–1827. (source Wikipedia)
-
Probabilistic Intepretation
I Quadratic error functions can be seen as Gaussian noisemodels [1, 2].
I Imagine we are seeing data given by,
y(xi) = mxi + c + �
where � is Gaussian noise with standard deviation σ,
� ∼ N(0, σ2
).
-
Noise Corrupted Mapping
I This implies that
yi ∼ N(mxi + c, σ2
)I Which we also write
p(yi|w, σ) = N(yi|mxi + c, σ2
)
-
Gaussian Likelihood
I If the noise is sampled independently for each data pointfrom the same density we have
p(y|m, c, σ2) =n∏
i=1
N(yi|mxi + c, σ2
)I This is an i.i.d. assumption about the noise.I Writing the functional form we have
p(y|m, c, σ2) =n∏
i=1
1√2πσ2
exp(− (yi −mxi − c)
2
2σ2
)
-
Gaussian Likelihood
I If the noise is sampled independently for each data pointfrom the same density we have
p(y|m, c, σ2) =n∏
i=1
N(yi|mxi + c, σ2
)I This is an i.i.d. assumption about the noise.I Writing the functional form we have
p(y|m, c, σ2) =n∏
i=1
1√2πσ2
exp(− (yi −mxi − c)
2
2σ2
)
-
Gaussian Likelihood
I If the noise is sampled independently for each data pointfrom the same density we have
p(y|m, c, σ2) =n∏
i=1
N(yi|mxi + c, σ2
)I This is an i.i.d. assumption about the noise.I Writing the functional form we have
p(y|m, c, σ2) ∝n∏
i=1
exp(− (yi −mxi − c)
2
2σ2
)
-
Gaussian Likelihood
I If the noise is sampled independently for each data pointfrom the same density we have
p(y|m, c, σ2) =n∏
i=1
N(yi|mxi + c, σ2
)I This is an i.i.d. assumption about the noise.I Writing the functional form we have
p(y|m, c, σ2) ∝n∏
i=1
exp(− (yi −mxi − c)
2
2σ2
)
-
Gaussian Likelihood
I If the noise is sampled independently for each data pointfrom the same density we have
p(y|m, c, σ2) =n∏
i=1
N(yi|mxi + c, σ2
)I This is an i.i.d. assumption about the noise.I Writing the functional form we have
p(y|m, c, σ2) ∝ exp− n∑
i=1
(yi −mxi − c)22σ2
-
Gaussian Likelihood
I If the noise is sampled independently for each data pointfrom the same density we have
p(y|m, c, σ2) =n∏
i=1
N(yi|mxi + c, σ2
)I This is an i.i.d. assumption about the noise.I Writing the functional form we have
p(y|m, c, σ2) ∝ exp− n∑
i=1
(yi −mxi − c)22σ2
-
Gaussian Likelihood
I If the noise is sampled independently for each data pointfrom the same density we have
p(y|m, c, σ2) =n∏
i=1
N(yi|mxi + c, σ2
)I This is an i.i.d. assumption about the noise.I Writing the functional form we have
p(y|m, c, σ2) ∝ exp− n∑
i=1
(yi −mxi − c)22σ2
-
Gaussian Log Likelihood
I If the noise is sampled independently for each data pointfrom the same density we have
p(y|m, c, σ2) =n∏
i=1
N(yi|mxi + c, σ2
)I This is an i.i.d. assumption about the noise.I Writing the functional form we have
logp(y|m, c, σ2) = − 12σ2
n∑i=1
(yi −mxi − c)2 + const
-
Gaussian Log Likelihood
I If the noise is sampled independently for each data pointfrom the same density we have
p(y|m, c, σ2) =n∏
i=1
N(yi|mxi + c, σ2
)I This is an i.i.d. assumption about the noise.I Writing the functional form we have
− logp(y|m, c, σ2) = 12σ2
n∑i=1
(yi −mxi − c)2 + const
-
Gaussian Log Likelihood
I If the noise is sampled independently for each data pointfrom the same density we have
p(y|m, c, σ2) =n∏
i=1
N(yi|mxi + c, σ2
)I This is an i.i.d. assumption about the noise.I Writing the functional form we have
− logp(y|m, c, σ2) = 12σ2
E(m, c) + const
-
Probabilistic Interpretation of the Error Function
I Probabilistic Interpretation for Error Function is NegativeLog Likelihood.
I Minimizing error function is equivalent to maximizing loglikelihood.
I Maximizing log likelihood is equivalent to maximizing thelikelihood because log is monotonic.
I Probabilistic interpretation: Minimizing error function isequivalent to maximum likelihood with respect toparameters.
-
Sample Based Approximation implies i.i.d
I The log likelihood is
L(θ) = log P(y|θ)I If the likelihood is independent over the individual data
points,
P(y|θ) =n∏
i=1
P(yi|θ)
I This is equivalent to the assumption that the data isindependent and identically distributed. This is known asi.i.d..
I Now the log likelihood is
L(θ) =n∑
i=1
log P(yi|θ)
I We take the negative log likelihood to recover the sum ofsquares error.
-
Bayesian perspective
-
Underdetermined System
What about two unknowns andone observation?
y1 = mx1 + c
012345
0 1 2 3y
x
-
Underdetermined System
Can compute m given c.
m =y1 − c
x
012345
0 1 2 3y
x
-
Underdetermined System
Can compute m given c.
c = 1.75 =⇒ m = 1.25
012345
0 1 2 3y
x
-
Underdetermined System
Can compute m given c.
c = −0.777 =⇒ m = 3.78
012345
0 1 2 3y
x
-
Underdetermined System
Can compute m given c.
c = −4.01 =⇒ m = 7.01
012345
0 1 2 3y
x
-
Underdetermined System
Can compute m given c.
c = −0.718 =⇒ m = 3.72
012345
0 1 2 3y
x
-
Underdetermined System
Can compute m given c.
c = 2.45 =⇒ m = 0.545
012345
0 1 2 3y
x
-
Underdetermined System
Can compute m given c.
c = −0.657 =⇒ m = 3.66
012345
0 1 2 3y
x
-
Underdetermined System
Can compute m given c.
c = −3.13 =⇒ m = 6.13
012345
0 1 2 3y
x
-
Underdetermined System
Can compute m given c.
c = −1.47 =⇒ m = 4.47
012345
0 1 2 3y
x
-
Underdetermined System
Can compute m given c.Assume
c ∼ N (0, 4) ,
we find a distribution of solu-tions.
012345
0 1 2 3y
x
-
Bayesian Approach
I Likelihood for the regression example has the form
p(y|w, σ2) =n∏
i=1
N(yi|w>φi, σ2
).
I Suggestion was to maximize this likelihood with respect tow.
I This can be done with gradient based optimization of thelog likelihood.
I Alternative approach: integration across w.I Consider expected value of likelihood under a range of
potential ws.I This is known as the Bayesian approach.
-
Note on the Term Bayesian
I We will use Bayes’ rule to invert probabilities in theBayesian approach.
I Bayesian is not named after Bayes’ rule (v. commonconfusion).
I The term Bayesian refers to the treatment of the parametersas stochastic variables.
I This approach was proposed by Laplace and Bayesindependently.
I For early statisticians this was very controversial (Fisher etal).
-
Bayesian Controversy
I Bayesian controversy relates to treating epistemicuncertainty as aleatoric uncertainty.
I Another analogy:I Before a football match the uncertainty about the result is
aleatoric.I If I watch a recorded match without knowing the result the
uncertainty is epistemic.
-
Simple Bayesian Inference
posterior =likelihood × prior
marginal likelihood
I Four components:1. Prior distribution: represents belief about parameter values
before seeing data.2. Likelihood: gives relation between parameters and data.3. Posterior distribution: represents updated belief about
parameters after data is observed.4. Marginal likelihood: represents assessment of the quality of
the model. Can be compared with other models(likelihood/prior combinations). Ratios of marginallikelihoods are known as Bayes factors.
-
Recap
I We can see the learning process as an optimizationproblem.
I We can also see the learning process as probabilisticmodelling (that also depends on parameters that need tobe optimised).
I The Bayesian frameworks allows us to handle ‘epistemic’uncertainty in systems.
I Examples only for linear functions, so far.
-
Generalizations: What if we want to use a moreexpressive model (non-linear function)
I ML as optimization: Regularization in RKHSs (kernelmethods)
I Bayesian perspective: Gaussian Processes.
Both approaches are related: as standard regression andBayesian regression are.
More important for us: Gaussian processes.
-
Basis FunctionsNonlinear Regression
I Problem with Linear Regression—x may not be linearlyrelated to y.
I Potential solution: create a feature space: define φ(x)where φ(·) is a nonlinear function of x.
I Model for target is a linear combination of these nonlinearfunctions
f (x) =K∑
j=1
w jφ j(x) (1)
-
Quadratic Basis
I Basis functions can be global. E.g. quadratic basis:
[1, x, x2]
-2
-1
0
1
2
-1 0 1
φ(x
)
x
φ(x) = 1
Figure: A quadratic basis.
-
Quadratic Basis
I Basis functions can be global. E.g. quadratic basis:
[1, x, x2]
-2
-1
0
1
2
-1 0 1
φ(x
)
x
φ(x) = 1
φ(x) = x
Figure: A quadratic basis.
-
Quadratic Basis
I Basis functions can be global. E.g. quadratic basis:
[1, x, x2]
-2
-1
0
1
2
-1 0 1
φ(x
)
x
φ(x) = 1
φ(x) = xφ(x) = x2
Figure: A quadratic basis.
-
Functions Derived from Quadratic Basisf (x) = w1 + w2x + w3x2
-4-3-2-10123
-1 0 1
f(x)
xFigure: Function from quadratic basis with weights w1 = 0.87466,w2 = −0.38835, w3 = −2.0058 .
-
Functions Derived from Quadratic Basisf (x) = w1 + w2x + w3x2
-4-3-2-10123
-1 0 1
f(x)
xFigure: Function from quadratic basis with weights w1 = −0.35908,w2 = 1.2274, w3 = −0.32825 .
-
Functions Derived from Quadratic Basisf (x) = w1 + w2x + w3x2
-4-3-2-10123
-1 0 1
f(x)
xFigure: Function from quadratic basis with weights w1 = −1.5638,w2 = −0.73577, w3 = 1.6861 .
-
Radial Basis Functions
I Or they can be local. E.g. radial (or Gaussian) basis
φ j(x) = exp(− (x−µ j)
2
`2
)
0
1
-2 -1 0 1 2
φ(x
)
x
φ1(x) = e−2(x+1)2
Figure: Radial basis functions.
-
Radial Basis Functions
I Or they can be local. E.g. radial (or Gaussian) basis
φ j(x) = exp(− (x−µ j)
2
`2
)
0
1
-2 -1 0 1 2
φ(x
)
x
φ1(x) = e−2(x+1)2
φ2(x) = e−2x2
Figure: Radial basis functions.
-
Radial Basis Functions
I Or they can be local. E.g. radial (or Gaussian) basis
φ j(x) = exp(− (x−µ j)
2
`2
)
0
1
-2 -1 0 1 2
φ(x
)
x
φ1(x) = e−2(x+1)2
φ2(x) = e−2x2
φ3(x) = e−2(x−1)2
Figure: Radial basis functions.
-
Functions Derived from Radial Basisf (x) = w1e−2(x+1)
2+ w2e−2x
2+ w3e−2(x−1)
2
-2
-1
0
1
2
-3 -2 -1 0 1 2 3
f(x)
xFigure: Function from radial basis with weights w1 = −0.47518,w2 = −0.18924, w3 = −1.8183 .
-
Functions Derived from Radial Basisf (x) = w1e−2(x+1)
2+ w2e−2x
2+ w3e−2(x−1)
2
-2
-1
0
1
2
-3 -2 -1 0 1 2 3
f(x)
xFigure: Function from radial basis with weights w1 = 0.50596,w2 = −0.046315, w3 = 0.26813 .
-
Functions Derived from Radial Basisf (x) = w1e−2(x+1)
2+ w2e−2x
2+ w3e−2(x−1)
2
-2
-1
0
1
2
-3 -2 -1 0 1 2 3
f(x)
xFigure: Function from radial basis with weights w1 = 0.07179,w2 = 1.3591, w3 = 0.50604 .
-
Probabilistic Model with Basis Functions
I Define a general function:
f (xi) = w>φ(xi)
I Corrupt with independent noise:
y(xi) = f (xi) + �i
� ∼n∏
i=1
N(0, σ2
)I Implies the following likelihood:
p(y|w, σ2) =n∏
i=1
N(yi|w>φ(xi), σ2
)
-
Multivariate Regression Likelihood
I Noise corrupted data point
yi = w>xi,: + �i
I Multivariate regression likelihood:
p(y|X,w) = 1(2πσ2)n/2
exp
− 12σ2n∑
i=1
(yi −w>xi,:
)2I Now use a multivariate Gaussian prior:
p(w) =1
(2πα)p2
exp(− 1
2αw>w
)
-
Multivariate Regression Likelihood
I Noise corrupted data point
yi = w>xi,: + �i
I Multivariate regression likelihood:
p(y|X,w) = 1(2πσ2)n/2
exp
− 12σ2n∑
i=1
(yi −w>xi,:
)2I Now use a multivariate Gaussian prior:
p(w) =1
(2πα)p2
exp(− 1
2αw>w
)
-
Multivariate Regression Likelihood
I Noise corrupted data point
yi = w>xi,: + �i
I Multivariate regression likelihood:
p(y|X,w) = 1(2πσ2)n/2
exp
− 12σ2n∑
i=1
(yi −w>xi,:
)2I Now use a multivariate Gaussian prior:
p(w) =1
(2πα)p2
exp(− 1
2αw>w
)
-
Posterior Density
I Once again we want to know the posterior:
p(w|y,X) ∝ p(y|X,w)p(w)
I And we can compute by completing the square.
log p(w|y,X) = − 12σ2
n∑i=1
y2i +1σ2
n∑i=1
yix>i,:w
− 12σ2
n∑i=1
w>xi,:x>i,:w −1
2αw>w + const.
-
Posterior Density
I Once again we want to know the posterior:
p(w|y,X) ∝ p(y|X,w)p(w)
I And we can compute by completing the square.
log p(w|y,X) = − 12σ2
n∑i=1
y2i +1σ2
n∑i=1
yix>i,:w
− 12σ2
n∑i=1
w>xi,:x>i,:w −1
2αw>w + const.
-
Computing the Posterior
I By inspection we extract the inverse covariance
log p(w|y,X) = − 12σ2
n∑i=1
y2i +1σ2
n∑i=1
yix>i,:w
− 12σ2
n∑i=1
w>xi,:x>i,:w −1
2αw>w + const.
I Completing the square allows us to compute the mean.
-
Computing the Posterior
I By inspection we extract the inverse covariance
log p(w|y,X) = − 12σ2
y>y +1σ2
y>Xw
− 12σ2
w>X>Xw − 12α
w>w + const.
I Completing the square allows us to compute the mean.
-
Computing the Posterior
I By inspection we extract the inverse covariance
log p(w|y,X) = − 12σ2
y>y +1σ2
y>Xw
− 12
w>[σ−1X>X + α−1I
]w + const.
I Completing the square allows us to compute the mean.
-
Making Predictions
I Giving a Gaussian density
p(w|y,X) = N(w|µw,Cw
)Cw =
[σ−2X>X + α−1I
]−1µw = Cwσ−2X>y
I Posterior is combined with ‘test data’ likelihood to makefuture predictions:
p(y∗|x∗,X,y) =∫
p(y∗|x∗,w)p(w|X,y)dw
-
Bayesian vs Maximum Likelihood
I Note the similarity between posterior mean
µw = (σ−2X>X + α−1I)−1σ−2X>y
I and Maximum likelihood solution
ŵ = (X>X)−1X>y
-
Marginal Likelihood
I In some sense though the real model is now the marginallikelihood.
I Marginalization of W follows sum rule
p(y|X) =∫
p(y|X,w)p(w)dw
givingp(y|X) = N
(y|0, αXX> + σ2I
)I Often the integral is intractable.I Leads to variational approximations, MCMC (Michael
Betancourt, Mark Girolami), Laplace approximation(Harvard Rue).
I For the case of Gaussians it’s trivial!!
-
Marginal Likelihood
I Can compute the marginal likelihood as:
p(y|X, α, σ) = N(y|0, αXX> + σ2I
)I Or if we use a basis set we have
p(y|X, α, σ) = N(y|0, αΦΦ> + σ2I
)I This Gaussian is no longer i.i.d. across data and this is
where things get interesting.
-
Marginal Likelihood
I The marginal likelihood can also be computed, it has theform:
p(y|X, σ2, α) = 1(2π)
n2 |K| 12
exp(−1
2y>K−1y
)where K = αΦΦ> + σ2I.
I So it is a zero mean n-dimensional Gaussian withcovariance matrix K.
-
Sampling a Function
Multi-variate Gaussians
I We will consider a Gaussian with a particular structure ofcovariance matrix.
I Generate a single sample from this 25 dimensionalGaussian distribution, f =
[f1, f2 . . . f25
].
I We will plot these points against their index.
-
Gaussian Distribution Sample
-2
-1
0
1
2
0 5 10 15 20 25
f i
i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)
ji
0
0.2
0.4
0.6
0.8
1
(b) colormap showing correlations be-tween dimensions.
Figure: A sample from a 25 dimensional Gaussian distribution.
-
Gaussian Distribution Sample
-2
-1
0
1
2
0 5 10 15 20 25
f i
i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)
ji
0
0.2
0.4
0.6
0.8
1
(b) colormap showing correlations be-tween dimensions.
Figure: A sample from a 25 dimensional Gaussian distribution.
-
Gaussian Distribution Sample
-2
-1
0
1
2
0 5 10 15 20 25
f i
i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)
0
0.2
0.4
0.6
0.8
1
(b) colormap showing correlations be-tween dimensions.
Figure: A sample from a 25 dimensional Gaussian distribution.
-
Gaussian Distribution Sample
-2
-1
0
1
2
0 5 10 15 20 25
f i
i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)
0
0.2
0.4
0.6
0.8
1
(b) colormap showing correlations be-tween dimensions.
Figure: A sample from a 25 dimensional Gaussian distribution.
-
Gaussian Distribution Sample
-2
-1
0
1
2
0 5 10 15 20 25
f i
i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)
0
0.2
0.4
0.6
0.8
1
(b) colormap showing correlations be-tween dimensions.
Figure: A sample from a 25 dimensional Gaussian distribution.
-
Gaussian Distribution Sample
-2
-1
0
1
2
0 5 10 15 20 25
f i
i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)
0
0.2
0.4
0.6
0.8
1
(b) colormap showing correlations be-tween dimensions.
Figure: A sample from a 25 dimensional Gaussian distribution.
-
Gaussian Distribution Sample
-2
-1
0
1
2
0 5 10 15 20 25
f i
i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)
0
0.2
0.4
0.6
0.8
1
(b) colormap showing correlations be-tween dimensions.
Figure: A sample from a 25 dimensional Gaussian distribution.
-
Computing the Expected Output
I Given the posterior for the parameters, how can wecompute the expected output at a given location?
I Output of model at location xi is given by
f (xi; w) = φ>i w
I We want the expected output under the posterior density,p(w|y,X, σ2, α).
I Mean of mapping function will be given by〈f (xi; w)
〉p(w|y,X,σ2,α) = φ
>i 〈w〉p(w|y,X,σ2,α)
= φ>i µw
-
Variance of Expected Output
I Variance of model at location xi is given by
var( f (xi; w)) =〈( f (xi; w))2
〉− 〈 f (xi; w)〉2
= φ>i〈ww>
〉φi −φ>i 〈w〉 〈w〉
>φi
= φ>i Cwφi
where all these expectations are taken under the posteriordensity, p(w|y,X, σ2, α).
-
Book
-
Olympic Marathon Data
I Gold medal times forOlympic Marathon since1896.
I Marathons before 1924didn’t have astandardised distance.
I Present results usingpace per km.
I In 1904 Marathon wasbadly organised leadingto very slow times.
Image from WikimediaCommons
http://bit.ly/16kMKHQ
http://bit.ly/16kMKHQ
-
Olympic Marathon Data
3
3.5
4
4.5
5
1900 1920 1940 1960 1980 2000 2020
y,pa
cem
in/k
m
x, year
Olympic Marathon Data.
-
Olympics Data analysis
I Use Bayesian approach on olympics data withpolynomials.
I Choose a prior w ∼ N (0, αI) with α = 1.I Choose noise variance σ2 = 0.01
-
Sampling the Prior
I Always useful to perform a ‘sanity check’ and sample fromthe prior before observing the data.
I Since y = Φw + � just need to sample
w ∼ N (0, α)
� ∼ N(0, σ2
)with α = 1 and � = 0.01.
-
Recall: Validation Set for Maximum Likelihood
2.53
3.54
4.55
5.5
1892 1932 1972 2012-40-20
020406080
100
0 1 2 3 4 5 6 7
polynomial order
Left: fit to data, Right: model error. Polynomial order 0, trainingerror -1.8774, validation error -0.13132, σ2 = 0.302, σ = 0.549.
-
Recall: Validation Set for Maximum Likelihood
2.53
3.54
4.55
5.5
1892 1932 1972 2012-40-20
020406080
100
0 1 2 3 4 5 6 7
polynomial order
Left: fit to data, Right: model error. Polynomial order 1, trainingerror -15.325, validation error 2.5863, σ2 = 0.0733, σ = 0.271.
-
Recall: Validation Set for Maximum Likelihood
2.53
3.54
4.55
5.5
1892 1932 1972 2012-40-20
020406080
100
0 1 2 3 4 5 6 7
polynomial order
Left: fit to data, Right: model error. Polynomial order 2, trainingerror -17.579, validation error -8.4831, σ2 = 0.0578, σ = 0.240.
-
Recall: Validation Set for Maximum Likelihood
2.53
3.54
4.55
5.5
1892 1932 1972 2012-40-20
020406080
100
0 1 2 3 4 5 6 7
polynomial order
Left: fit to data, Right: model error. Polynomial order 3, trainingerror -18.064, validation error 11.27, σ2 = 0.0549, σ = 0.234.
-
Recall: Validation Set for Maximum Likelihood
2.53
3.54
4.55
5.5
1892 1932 1972 2012-40-20
020406080
100
0 1 2 3 4 5 6 7
polynomial order
Left: fit to data, Right: model error. Polynomial order 4, trainingerror -18.245, validation error 232.92, σ2 = 0.0539, σ = 0.232.
-
Recall: Validation Set for Maximum Likelihood
2.53
3.54
4.55
5.5
1892 1932 1972 2012-40-20
020406080
100
0 1 2 3 4 5 6 7
polynomial order
Left: fit to data, Right: model error. Polynomial order 5, trainingerror -20.471, validation error 9898.1, σ2 = 0.0426, σ = 0.207.
-
Recall: Validation Set for Maximum Likelihood
2.53
3.54
4.55
5.5
1892 1932 1972 2012-40-20
020406080
100
0 1 2 3 4 5 6 7
polynomial order
Left: fit to data, Right: model error. Polynomial order 6, trainingerror -22.881, validation error 67775, σ2 = 0.0331, σ = 0.182.
-
Validation Set
2.53
3.54
4.55
5.5
1892 1932 1972 2012-40-20
020406080
100
0 1 2 3 4 5 6 7
polynomial order
Left: fit to data, Right: model error. Polynomial order 0, trainingerror 29.757, validation error -0.29243, σ2 = 0.302, σ = 0.550.
-
Validation Set
2.53
3.54
4.55
5.5
1892 1932 1972 2012-40-20
020406080
100
0 1 2 3 4 5 6 7
polynomial order
Left: fit to data, Right: model error. Polynomial order 1, trainingerror 14.942, validation error 4.4027, σ2 = 0.0762, σ = 0.276.
-
Validation Set
2.53
3.54
4.55
5.5
1892 1932 1972 2012-40-20
020406080
100
0 1 2 3 4 5 6 7
polynomial order
Left: fit to data, Right: model error. Polynomial order 2, trainingerror 9.7206, validation error -8.6623, σ2 = 0.0580, σ = 0.241.
-
Validation Set
2.53
3.54
4.55
5.5
1892 1932 1972 2012-40-20
020406080
100
0 1 2 3 4 5 6 7
polynomial order
Left: fit to data, Right: model error. Polynomial order 3, trainingerror 10.416, validation error -6.4726, σ2 = 0.0555, σ = 0.236.
-
Validation Set
2.53
3.54
4.55
5.5
1892 1932 1972 2012-40-20
020406080
100
0 1 2 3 4 5 6 7
polynomial order
Left: fit to data, Right: model error. Polynomial order 4, trainingerror 11.34, validation error -8.431, σ2 = 0.0555, σ = 0.236.
-
Validation Set
2.53
3.54
4.55
5.5
1892 1932 1972 2012-40-20
020406080
100
0 1 2 3 4 5 6 7
polynomial order
Left: fit to data, Right: model error. Polynomial order 5, trainingerror 11.986, validation error -10.483, σ2 = 0.0551, σ = 0.235.
-
Validation Set
2.53
3.54
4.55
5.5
1892 1932 1972 2012-40-20
020406080
100
0 1 2 3 4 5 6 7
polynomial order
Left: fit to data, Right: model error. Polynomial order 6, trainingerror 12.369, validation error -3.3823, σ2 = 0.0537, σ = 0.232.
-
Regularized Mean
I Validation fit here based on mean solution for w only.I For Bayesian solution
µw =[σ−2Φ>Φ + α−1I
]−1σ−2Φ>y
instead ofw∗ =
[Φ>Φ
]−1Φ>y
I Two are equivalent when α→∞.I Equivalent to a prior for w with infinite variance.I In other cases αI regularizes the system (keeps parameters
smaller).
-
Sampling the Posterior
I Now check samples by extracting w from the posterior.I Now for y = Φw + � need
w ∼ N(µw,Cw
)with Cw =
[σ−2Φ>Φ + α−1I
]−1and µw = Cwσ−2Φ>y
� ∼ N(0, σ2
)with α = 1 and � = 0.01.
-
Direct Construction of Covariance Matrix
Use matrix notation to write function,
f (xi; w) =m∑
k=1
wkφk (xi)
-
Direct Construction of Covariance Matrix
Use matrix notation to write function,
f (xi; w) =m∑
k=1
wkφk (xi)
computed at training data gives a vector
f = Φw.
-
Direct Construction of Covariance Matrix
Use matrix notation to write function,
f (xi; w) =m∑
k=1
wkφk (xi)
computed at training data gives a vector
f = Φw.
w ∼ N (0, αI)
-
Direct Construction of Covariance Matrix
Use matrix notation to write function,
f (xi; w) =m∑
k=1
wkφk (xi)
computed at training data gives a vector
f = Φw.
w ∼ N (0, αI)
w and f are only related by an inner product.
-
Direct Construction of Covariance Matrix
Use matrix notation to write function,
f (xi; w) =m∑
k=1
wkφk (xi)
computed at training data gives a vector
f = Φw.
w ∼ N (0, αI)
w and f are only related by an inner product.
Φ ∈
-
Direct Construction of Covariance Matrix
Use matrix notation to write function,
f (xi; w) =m∑
k=1
wkφk (xi)
computed at training data gives a vector
f = Φw.
w ∼ N (0, αI)
w and f are only related by an inner product.
Φ ∈
-
Direct Construction of Covariance Matrix
Use matrix notation to write function,
f (xi; w) =m∑
k=1
wkφk (xi)
computed at training data gives a vector
f = Φw.
w ∼ N (0, αI)
w and f are only related by an inner product.
Φ ∈
-
Expectations
I We have〈f〉 = Φ 〈w〉 .
I Prior mean of w was zero giving
〈f〉 = 0.
I Prior covariance of f is
K =〈ff>
〉− 〈f〉 〈f〉>
We use 〈·〉 to denote expectations under prior distributions.
-
Expectations
I We have〈f〉 = Φ 〈w〉 .
I Prior mean of w was zero giving
〈f〉 = 0.
I Prior covariance of f is
K =〈ff>
〉− 〈f〉 〈f〉>
We use 〈·〉 to denote expectations under prior distributions.
-
Expectations
I We have〈f〉 = Φ 〈w〉 .
I Prior mean of w was zero giving
〈f〉 = 0.
I Prior covariance of f is
K =〈ff>
〉− 〈f〉 〈f〉>
We use 〈·〉 to denote expectations under prior distributions.
-
Expectations
I We have〈f〉 = Φ 〈w〉 .
I Prior mean of w was zero giving
〈f〉 = 0.I Prior covariance of f is
K =〈ff>
〉− 〈f〉 〈f〉>
〈ff>
〉= Φ
〈ww>
〉Φ>,
givingK = αΦΦ>.
We use 〈·〉 to denote expectations under prior distributions.
-
Covariance between Two Points
I The prior covariance between two points xi and x j is
k(xi, x j
)= αφ: (xi)
> φ:(x j
),
or in sum notation
k(xi, x j
)= α
m∑k=1
φk (xi)φk(x j
)I For the radial basis used this gives
k(xi, x j
)= α
m∑k=1
exp
−∣∣∣xi − µk∣∣∣2 + ∣∣∣x j − µk∣∣∣2
2`2
.
-
Covariance between Two Points
I The prior covariance between two points xi and x j is
k(xi, x j
)= αφ: (xi)
> φ:(x j
),
or in sum notation
k(xi, x j
)= α
m∑k=1
φk (xi)φk(x j
)I For the radial basis used this gives
k(xi, x j
)= α
m∑k=1
exp
−∣∣∣xi − µk∣∣∣2 + ∣∣∣x j − µk∣∣∣2
2`2
.
-
Covariance between Two Points
I The prior covariance between two points xi and x j is
k(xi, x j
)= αφ: (xi)
> φ:(x j
),
or in sum notation
k(xi, x j
)= α
m∑k=1
φk (xi)φk(x j
)I For the radial basis used this gives
k(xi, x j
)= α
m∑k=1
exp
−∣∣∣xi − µk∣∣∣2 + ∣∣∣x j − µk∣∣∣2
2`2
.
-
Covariance between Two Points
I The prior covariance between two points xi and x j is
k(xi, x j
)= αφ: (xi)
> φ:(x j
),
or in sum notation
k(xi, x j
)= α
m∑k=1
φk (xi)φk(x j
)I For the radial basis used this gives
k(xi, x j
)= α
m∑k=1
exp
−∣∣∣xi − µk∣∣∣2 + ∣∣∣x j − µk∣∣∣2
2`2
.
-
Covariance Functions and Mercer Kernels
I Mercer Kernels and Covariance Functions are similar.I the kernel perspective does not make a probabilistic
interpretation of the covariance function.I Algorithms can be simpler, but probabilistic interpretation
is crucial for kernel parameter optimization.
-
Covariance Functions and Mercer Kernels
I Mercer Kernels and Covariance Functions are similar.I the kernel perspective does not make a probabilistic
interpretation of the covariance function.I Algorithms can be simpler, but probabilistic interpretation
is crucial for kernel parameter optimization.
-
Covariance Functions and Mercer Kernels
I Mercer Kernels and Covariance Functions are similar.I the kernel perspective does not make a probabilistic
interpretation of the covariance function.I Algorithms can be simpler, but probabilistic interpretation
is crucial for kernel parameter optimization.
-
More on Mercer Kernels
Let X be a metric space and K : X × X→< a continuous andsymmetric function. If we assume that K is positive definite,that is, for any set x = {x1, · · · , xn} ⊂ X the n × n matrix K withcomponents
Ki j = K(xi, x j),
is positive semi-definite, then K is a Mercer kernel.
-
More on Mercer Kernels
Mercer’s Theorem (1909): Let K : X × X −→ < a Mercer’skernel. Let λ j the j-th eigenvalue of LK and {φ j} j≥1 thecorresponding eigenvector. Then, for all x, x′ ∈ X
K(x,y) =∞∑j=1
λ jφ j(x)φ j(y)
where the convergence is absolute (for each (x, x′) ∈ X × X) anduniform (on (x, x′) ∈ X × X).
By using directly a kernel we are using a basis functionimplicitly (possibly with infinity elements: Bayesian
non-parametrics).
-
Prediction with Correlated Gaussians
I Prediction of f∗ from f requires multivariate conditionaldensity.
I Multivariate conditional density is also Gaussian.
p(f∗|f) = N(f∗|K∗,fK−1f,f f,K∗,∗ −K∗,fK−1f,f Kf,∗
)
I Here covariance of joint density is given by
K =[Kf,f K∗,fKf,∗ K∗,∗
]
-
Prediction with Correlated Gaussians
I Prediction of f∗ from f requires multivariate conditionaldensity.
I Multivariate conditional density is also Gaussian.
p(f∗|f) = N(f∗|µ,Σ
)µ = K∗,fK−1f,f f
Σ = K∗,∗ −K∗,fK−1f,f Kf,∗I Here covariance of joint density is given by
K =[Kf,f K∗,fKf,∗ K∗,∗
]
-
Constructing Covariance Functions
I Sum of two covariances is also a covariance function.
k(x, x′) = k1(x, x′) + k2(x, x′)
-
Constructing Covariance Functions
I Product of two covariances is also a covariance function.
k(x, x′) = k1(x, x′)k2(x, x′)
-
Multiply by Deterministic Function
I If f (x) is a Gaussian process.I g(x) is a deterministic function.I h(x) = f (x)g(x)I Then
kh(x, x′) = g(x)k f (x, x′)g(x′)
where kh is covariance for h(·) and k f is covariance for f (·).
-
Covariance Functions
MLP Covariance Function
k (x, x′) = αasin(
wx>x′ + b√wx>x + b + 1
√wx′>x′ + b + 1
)
I Based on infinite neuralnetwork model.
w = 40
b = 4
-
Covariance Functions
MLP Covariance Function
k (x, x′) = αasin(
wx>x′ + b√wx>x + b + 1
√wx′>x′ + b + 1
)
I Based on infinite neuralnetwork model.
w = 40
b = 4
-
Covariance Functions
Linear Covariance Function
k (x, x′) = αx>x′
I Bayesian linearregression.
α = 1
-
Covariance Functions
Linear Covariance Function
k (x, x′) = αx>x′
I Bayesian linearregression.
α = 1
-
Covariance FunctionsWhere did this covariance matrix come from?
Ornstein-Uhlenbeck (stationary Gauss-Markov) covariancefunction
k (x, x′) = α exp(−|x − x
′|2`2
)I In one dimension arises
from a stochasticdifferential equation.Brownian motion in aparabolic tube.
I In higher dimension aFourier filter of the form
1π(1+x2) .
-
Covariance FunctionsWhere did this covariance matrix come from?
Ornstein-Uhlenbeck (stationary Gauss-Markov) covariancefunction
k (x, x′) = α exp(−|x − x
′|2`2
)I In one dimension arises
from a stochasticdifferential equation.Brownian motion in aparabolic tube.
I In higher dimension aFourier filter of the form
1π(1+x2) .
-
Covariance FunctionsWhere did this covariance matrix come from?
Markov Process
k (t, t′) = αmin(t, t′)
I Covariance matrix isbuilt using the inputs tothe function t.
-
Covariance FunctionsWhere did this covariance matrix come from?
Markov Process
k (t, t′) = αmin(t, t′)
I Covariance matrix isbuilt using the inputs tothe function t.
-3-2-10123
0 0.5 1 1.5 2
-
Covariance FunctionsWhere did this covariance matrix come from?
Matern 5/2 Covariance Function
k (x, x′) = α(1 +√
5r +53
r2)
exp(−√
5r)
where r =‖x − x′‖2
`
I Matern 5/2 is a twicedifferentiablecovariance.
I Matern familyconstructed withStudent-t filters inFourier space.
-
Covariance FunctionsWhere did this covariance matrix come from?
Matern 5/2 Covariance Function
k (x, x′) = α(1 +√
5r +53
r2)
exp(−√
5r)
where r =‖x − x′‖2
`
I Matern 5/2 is a twicedifferentiablecovariance.
I Matern familyconstructed withStudent-t filters inFourier space.
-
Covariance Functions
RBF Basis Functions
k (x, x′) = αφ(x)>φ(x′)
φk(x) = exp
−∥∥∥x − µk∥∥∥22
`2
µ =
−101
-
Covariance Functions
RBF Basis Functions
k (x, x′) = αφ(x)>φ(x′)
φk(x) = exp
−∥∥∥x − µk∥∥∥22
`2
µ =
−101
-3-2-10123
-3 -2 -1 0 1 2 3
-
Covariance FunctionsWhere did this covariance matrix come from?
Exponentiated Quadratic Kernel Function (RBF, SquaredExponential, Gaussian)
k (x, x′) = α exp
−‖x − x′‖222`2
I Covariance matrix isbuilt using the inputs tothe function x.
I For the example above itwas based on Euclideandistance.
I The covariance functionis also know as a kernel.
-
Covariance FunctionsWhere did this covariance matrix come from?
Exponentiated Quadratic Kernel Function (RBF, SquaredExponential, Gaussian)
k (x, x′) = α exp
−‖x − x′‖222`2
I Covariance matrix isbuilt using the inputs tothe function x.
I For the example above itwas based on Euclideandistance.
I The covariance functionis also know as a kernel.
-
Gaussian Process Interpolation
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
f(x)
x
Figure: Real example: BACCO (see e.g. [3]). Interpolation throughoutputs from slow computer simulations (e.g. atmospheric carbonlevels).
-
Gaussian Process Interpolation
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
f(x)
x
Figure: Real example: BACCO (see e.g. [3]). Interpolation throughoutputs from slow computer simulations (e.g. atmospheric carbonlevels).
-
Gaussian Process Interpolation
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
f(x)
x
Figure: Real example: BACCO (see e.g. [3]). Interpolation throughoutputs from slow computer simulations (e.g. atmospheric carbonlevels).
-
Gaussian Process Interpolation
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
f(x)
x
Figure: Real example: BACCO (see e.g. [3]). Interpolation throughoutputs from slow computer simulations (e.g. atmospheric carbonlevels).
-
Gaussian Process Interpolation
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
f(x)
x
Figure: Real example: BACCO (see e.g. [3]). Interpolation throughoutputs from slow computer simulations (e.g. atmospheric carbonlevels).
-
Gaussian Process Interpolation
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
f(x)
x
Figure: Real example: BACCO (see e.g. [3]). Interpolation throughoutputs from slow computer simulations (e.g. atmospheric carbonlevels).
-
Gaussian Process Interpolation
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
f(x)
x
Figure: Real example: BACCO (see e.g. [3]). Interpolation throughoutputs from slow computer simulations (e.g. atmospheric carbonlevels).
-
Gaussian Process Interpolation
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
f(x)
x
Figure: Real example: BACCO (see e.g. [3]). Interpolation throughoutputs from slow computer simulations (e.g. atmospheric carbonlevels).
-
Gaussian Noise
I Gaussian noise model,
p(yi| fi
)= N
(yi| fi, σ2
)where σ2 is the variance of the noise.
I Equivalent to a covariance function of the form
k(xi, x j) = δi, jσ2
where δi, j is the Kronecker delta function.I Additive nature of Gaussians means we can simply add
this term to existing covariance matrices.
-
Gaussian Process Regression
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
y(x)
x
Figure: Examples include WiFi localization, C14 callibration curve.
-
Gaussian Process Regression
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
y(x)
x
Figure: Examples include WiFi localization, C14 callibration curve.
-
Gaussian Process Regression
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
y(x)
x
Figure: Examples include WiFi localization, C14 callibration curve.
-
Gaussian Process Regression
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
y(x)
x
Figure: Examples include WiFi localization, C14 callibration curve.
-
Gaussian Process Regression
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
y(x)
x
Figure: Examples include WiFi localization, C14 callibration curve.
-
Gaussian Process Regression
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
y(x)
x
Figure: Examples include WiFi localization, C14 callibration curve.
-
Gaussian Process Regression
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
y(x)
x
Figure: Examples include WiFi localization, C14 callibration curve.
-
Gaussian Process Regression
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
y(x)
x
Figure: Examples include WiFi localization, C14 callibration curve.
-
Gaussian Process Regression
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
y(x)
x
Figure: Examples include WiFi localization, C14 callibration curve.
-
I Learning with Gaussian processes allows to characterizethe problem uncertainty:.
I We can choose a basis of functions o directly to select akernel (equivalent, but better to choose the kernel:non-parametric models).
I Given a covariance (prior), how to select the rightparameters?
I Back to optimization...
-
Learning Covariance ParametersCan we determine covariance parameters from the data?
N (y|0,K) = 1(2π)
n2 |K|12
exp(−y>K−1y
2
)
The parameters are inside the covariancefunction (matrix).
ki, j = k(xi, x j;θ)
-
Learning Covariance ParametersCan we determine covariance parameters from the data?
N (y|0,K) = 1(2π)
n2 |K|12
exp(−y>K−1y
2
)
The parameters are inside the covariancefunction (matrix).
ki, j = k(xi, x j;θ)
-
Learning Covariance ParametersCan we determine covariance parameters from the data?
logN (y|0,K) =−12
log |K|−y>K−1y
2− n
2log 2π
The parameters are inside the covariancefunction (matrix).
ki, j = k(xi, x j;θ)
-
Learning Covariance ParametersCan we determine covariance parameters from the data?
E(θ) =12
log |K| + y>K−1y
2
The parameters are inside the covariancefunction (matrix).
ki, j = k(xi, x j;θ)
-
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10−1 100 101
length scale, `
E(θ) =12
log |K| + y>K−1y
2
-
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10−1 100 101
length scale, `
E(θ) =12
log |K| + y>K−1y
2
-
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10−1 100 101
length scale, `
E(θ) =12
log |K| + y>K−1y
2
-
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10−1 100 101
length scale, `
E(θ) =12
log |K| + y>K−1y
2
-
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10−1 100 101
length scale, `
E(θ) =12
log |K| + y>K−1y
2
-
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10−1 100 101
length scale, `
E(θ) =12
log |K| + y>K−1y
2
-
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10−1 100 101
length scale, `
E(θ) =12
log |K| + y>K−1y
2
-
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10−1 100 101
length scale, `
E(θ) =12
log |K| + y>K−1y
2
-
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10−1 100 101
length scale, `
E(θ) =12
log |K| + y>K−1y
2
-
Limitations of Gaussian Processes
I Inference is O(n3) due to matrix inverse (in practice useCholesky).
I Gaussian processes don’t deal well with discontinuities(financial crises, phosphorylation, collisions, edges inimages).
I Widely used exponentiated quadratic covariance (RBF) canbe too smooth in practice (but there are manyalternatives!!).
-
Conclusions
I Machine learning has focussed on prediction.I Two main approaches: optimize objective, or model
probabilistically.I Both approaches require to optimize parameters.I Gaussian processes: fundamental models to deal with
uncertainty in complex scenarios.
-
Tomorow
I Global optimization.I Parameter tuning in Machine Learning as a global
optimization problem.I Can we automate the parameter choice of Machine
Learning algorithms?I Yes! Bayesian Optimization.
-
References I
[1] Carl Friedrich Gauss. Theoria motus corporum coelestium. Perthes et Besser,Hamburg.
[2] Pierre Simon Laplace. Mémoire sur la probabilité des causes par lesévènemens. In Mémoires de mathèmatique et de physique, presentés àlAcadémie Royale des Sciences, par divers savans, & lù dans ses assemblées 6,pages 621–656, 1774. Translated in [5].
[3] Jeremey Oakley and Anthony O’Hagan. Bayesian inference for theuncertainty distribution of computer model outputs. Biometrika,89(4):769–784, 2002.
[4] Carl Edward Rasmussen and Christopher K. I. Williams. GaussianProcesses for Machine Learning. MIT Press, Cambridge, MA, 2006.
[5] Stephen M. Stigler. Laplace’s 1774 memoir on inverse probability.Statistical Science, 1:359–378, 1986.
IntroductionLearning as optimizationMultivariate Bayesian Linear RegressionMercer KernelsTwo Point MarginalsConstructing CovarianceGP InterpolationGP RegressionParameter OptimizationGP Limitations
0.0: 0.1: 0.2: 0.3: 0.4: 0.5: 0.6: 0.7: 0.8: 0.9: 0.10: 0.11: 0.12: 0.13: 0.14: 0.15: 0.16: 0.17: 0.18: 0.19: 0.20: 0.21: 0.22: 0.23: 0.24: 0.25: 0.26: 0.27: 0.28: 0.29: 0.30: 0.31: 0.32: 0.33: 0.34: 0.35: 0.36: 0.37: 0.38: 0.39: anm0: 1.0: 1.1: 1.2: 1.3: 1.4: 1.5: 1.6: 1.7: 1.8: 1.9: 1.10: 1.11: 1.12: 1.13: 1.14: 1.15: 1.16: 1.17: 1.18: 1.19: 1.20: 1.21: 1.22: 1.23: 1.24: 1.25: 1.26: 1.27: 1.28: 1.29: anm1: 2.0: 2.1: 2.2: 2.3: 2.4: 2.5: 2.6: 2.7: 2.8: 2.9: 2.10: 2.11: 2.12: 2.13: 2.14: 2.15: 2.16: 2.17: 2.18: 2.19: 2.20: 2.21: 2.22: 2.23: 2.24: 2.25: 2.26: 2.27: 2.28: 2.29: anm2: 3.0: 3.1: 3.2: 3.3: 3.4: 3.5: 3.6: 3.7: 3.8: 3.9: 3.10: 3.11: 3.12: 3.13: 3.14: 3.15: 3.16: 3.17: 3.18: 3.19: 3.20: 3.21: 3.22: 3.23: 3.24: 3.25: 3.26: 3.27: 3.28: 3.29: anm3: 4.0: 4.1: 4.2: 4.3: 4.4: 4.5: 4.6: 4.7: 4.8: 4.9: 4.10: 4.11: 4.12: 4.13: 4.14: 4.15: 4.16: 4.17: 4.18: 4.19: 4.20: 4.21: 4.22: 4.23: 4.24: 4.25: 4.26: 4.27: 4.28: 4.29: anm4: