Tutorial on Bayesian learning and related methods A pre-seminar
Transcript of Tutorial on Bayesian learning and related methods A pre-seminar
Tutorial on Bayesian learning and related methodsA pre-seminar for Simon Godsill’s talk
Simon Wilson
Trinity College Dublin
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk1 / 58
Probability: the basics (1)
Probability: the mathematics of describing uncertain quantities;
P(A) is the probability that the event A occurs;Rules of probability are:
1 0 ≤ P(A) ≤ 1;2 If A and B are mutually exclusive then P(A or B) = P(A) + P(B);3 P(A and B) = P(A) P(B |A), where P(B |A) means the probability of
B given that A has occurred.
Two events A and B are called independent if P(B |A) = P(B),and so P(A and B) = P(A) P(B).
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk2 / 58
Probability: the basics (2)
A random variable is real-valued and its value is uncertain.Denoted by a capital letter e.g. X , its value by small letter x ;
Random variables can be discrete or continuous;If discrete, X is described by its probability mass function,pX (x) = P(X = x);
pX (x) ≥ 0,∑∀x pX (x) = 1.
If continuous, the probability density function pX (x) is used:
pX (x) ≥ 0 and has the property that P(a < X < b) =∫ b
apX (x) dx ;
The cumulative distribution function FX (x) is P(X ≤ x)
FX (x) =
{∑s≤x pX (s), if X discrete,∫
s≤xpX (s) ds, if X continuous.
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk3 / 58
Probability: the basics (3)
The expected value or mean of a random variable is:
E(X ) =
{∑∀x xpX (x), if X is discrete;∫∀x xpX (x) dx , if X is continuous.
It’s the ’average’ value of X , the ’centre of gravity’ of thedistribution;
The variance of X is E((X − E(X ))2):
Var(X ) =
{∑∀x(x − E(X ))2pX (x), if X is discrete;∫∀x(x − E(X ))2pX (x) dx , if X is continuous.
It’s a measure of how variable the value of X can be;
The standard deviation is√
Var(X );It has the same units of measurement as X and E(X ).
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk4 / 58
Probability: the basics (4)
Examples of discrete random variable distributions are theBernoulli, binomial and Poisson:
pX (x | p) = px(1− p)1−x , x ∈ {0, 1};
pX (x | n, p) =
(n
x
)px(1− p)n−x , x ∈ {0, 1, . . . , n};
pX (x |λ) =λx
x!e−λ, x = 0, 1, 2, . . .
Are these familiar?
Good models for many physical phenomena;
Note that they are all defined in terms of parameters — p, n, λ— we think of these as conditional distributions of X given theparameter.
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk5 / 58
Probability: the basics (5)
Examples of continuous random variable distributions are theexponential and normal (or Gaussian):
pX (x |µ) =1
µe−x/µ x ≥ 0;
pX (x |µ, σ2) =1√
2πσ2exp
(− 1
2σ2(x − µ)2
), x ∈ R.
Are these familiar?The normal distribution occurs in many places (and will in theseminar, repeatedly)
µ is its mean and σ2 is its variance;
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk6 / 58
Some normal pdf plots
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk7 / 58
Probability: the basics (6)
If we have two random variables X and Y then we can define thejoint pmf/pdf p(x , y):
For discrete X , Y , p(x , y) = P(X = x and Y = y);For continuous X , Y ,∫ b
a
∫ d
cp(x , y) dx dy = P(c < X < d , a < Y < b);
The laws of probability show that pX (x) =∫∀y p(x , y) dy ;
For discrete X and Y , the conditional distribution of X givenY = y is
P(X = x |Y = y) =p(x , y)
pY (y).
X and Y are called independent ifP(X = x |Y = y) = P(X = x) andP(Y = y |X = x) = P(Y = y);
In this case, p(x , y) = pX (x) pY (y)
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk8 / 58
Probability: two important laws (1)
Two urns: urn I has 3 red and 3 blue balls, urn II has 2 red and 4green balls;
I flip a fair coin (so P(H) = P(T ) = 1/2). If H then pick a ballfrom urn I else pick one from urn II.
What is P(R) = P(red ball picked)? By laws of probability:
P(R) = P((H and R) or (T and R))
= P(H and R) + P(T and R)
= P(H) P(R |H) + P(T ) P(R |T )
=∑
y=H,T
P(y) P(R | y)
( = 1/2× 3/6 + 1/2× 2/6 = 5/12).
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk9 / 58
Probability: two important laws (2)
Two urns: urn I has 3 red and 3 blue balls, urn II has 2 red and 4green balls;
I flip a fair coin . If H then pick a ball from urn I else pick onefrom urn II.
Now I tell you that I picked a red ball. What is the chance that Iflipped a H? This is P(H |R):
P(H |R) = P(H and R)/P(R)
=P(H) P(R |H)
P(R)
= (1/2× 3/6)/(5/12) = 3/5.
Note that if I do not tell you the ball colour, the chance of a H isP(H) = 1/2.
Observing R has allowed you to learn about how likely is H .
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk10 / 58
Probability: two important laws (3)
The first equation is an example of the partition law.
In terms of random variables, we write that for any two randomvariables X and Y :
pX (x) =∑∀y
pX |Y (x |Y = y) pY (y),
where pX |Y (x |Y = y) is the conditional pmf of X given Y .
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk11 / 58
Probability: two important laws (4)
The second equation is an example of Bayes’ law.
It can be written as:
pY |X (y |X = x) =pY (y) pX |Y (x |Y = y)
pX (x),
which is often written (by Partition law):
pY |X (y |X = x) =pY (y) pX |Y (x |Y = y)∑∀y pY (y) pX |Y (x |Y = y)
.
We also see Bayes law written as:
pY |X (y |X = x) ∝ pY (y) pX |Y (x |Y = y).∑replaced by
∫if Y is continuous.
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk12 / 58
What is Monte Carlo simulation?
This is to generate a sequence of values from a probabilitydistribution;
Usually done by computer;To Monte Carlo simulate from a probability distribution pX (x)means to generate a sequence of values x1, x2, . . . , xN such that:
The values are independentIf X discrete, the proportion of values equal to x converges to pX (x),∀x as N →∞;If X is continuous, the proportion of values in the interval (a, b)
converges to∫ b
apX (x) dx , ∀a, b as N →∞;
E.g. simulation of a die: 4, 2, 5, 5, 1, 6, 3, 4, 2 , 1, . . .
What about 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, . . .?
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk13 / 58
Pseudo-random numbers
Computers are logic machines ⇒ should be no good at MonteCarlo simulation!This is true — the best we can do is generate deterministicsequences of numbers;
These sequences have many of the properties of ’really’ randomsequences;For most purposes they are indistinguishable from using ’the real thing’;They can also be generated very quickly (≈ 107 / second);
The basis of Monte Carlo methods are random numbers — theseare uniformly distributed between 0 and 1(pX (x) = 1, 0 ≤ x ≤ 1);There are many algorithms for generating deterministic sequencesthat look like random numbers:
These are called pseudo-random numbers;
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk14 / 58
Methods of generating other distributions
There are many methods of generating values from otherprobability distributions: discrete, normal, exponential, etc;
All of these rely on a supply of pseudo-random numbers;
Many computer packages are able to Monte Carlo simulate frommany distributions: MATLAB, R, even Excel!
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk15 / 58
Example: discrete probability distributions
Let X be tomorrow’s weather, X ∈ {sunny, cloudy, rainy};Suppose P(X = S) = 0.2,P(X = C ) = 0.3,P(X = R) = 0.5;We can Monte Carlo simulate this distribution as follows:
Generate a (pseudo-) random number u;If u < 0.2 then X = S ;if 0.2 ≤ u < 0.5 then X = C ;if u ≥ 0.5 then X = R.
This idea is called the inverse distribution method and works forall discrete distributions.
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk16 / 58
Bayes law is the basis for learning
In the urn problem, observing R tells you something about thecoin flip but does not tell you if it’s H or T with certainty;
The question is then: how “certain” can I be that the flip is a H?Or T?
Bayes’ law allowed us to compute how certain, as a probability, interms of probabilities that we know.
This situation occurs everywhere in data analysis and is the basisof statistical inference (or statistical learning);
Bayesian statistical inference defines what we learn through aprobability distribution on the quantity of interest;
Often this is defined through Bayes’ law
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk17 / 58
Slightly more complicated example (1)
What is the temperature in this room?
For simplicity, let’s assume that it’s constant all over the room.I have a thermometer and it measures 18.1◦C ;
Is that the “real” temperature in the room?Why not?
I have another identical make of thermometer. It measures18.1◦C as well.
Should I be more certain about the value of the real temperature now?If yes then by how much?What if the second thermometer had read 18.4◦C?
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk18 / 58
Slightly more complicated example (2)
What does Bayesian inference say about how to answer thisquestion?
Let T be the true (and unknown) temperature in the room;
Let x1 and x2 be the temperature measurements;
Our state of knowledge about T is defined by pT (t | x1, x2);
By Bayes’ law:
pT (t | x1) =pT (t) p(x1 |T )
p(x1)∝ pT (t) p(x1 |T );
pT (t | x1, x2) ∝ pT (t) p(x1, x2 |T ).
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk19 / 58
Slightly more complicated example (3)
pT (t) represents what we think T is before we measure it;
This is known as the prior distribution;
For example, we are pretty sure that 0 ≤ T ≤ 40; one possibilityis a uniform distribution on this range:
pT (t) =1
40, 0 ≤ t ≤ 40.
Another is a normal distribution with mean as our best guess(say 20◦C ) and a standard deviation of 10 (so that 0 ≤ T ≤ 40with high probability)
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk20 / 58
Two possible priors for T
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk21 / 58
Slightly more complicated example (4)
p(x1, x2 |T = t) describes what we measure given the truetemperature is t;
One reasonable model is that what we measure is normallydistributed with mean T and a variance σ2;
Here we assume that we know σ2 (it says how accurate ourthermometer is — let’s say σ2 = 0.32);
p(x1 |T = t) =1√
2πσ2e−0.5(x1−t)2/σ2
.
Also assume that the two measurements are independent givenT , so that:
p(x1, x2 |T = t) = p(x1 |T = t) p(x2 | t = t)
=1
2πσ2e−
12σ2 [(x1−t)2+(x2−t)2].
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk22 / 58
Distribution of x1 when T = 18
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk23 / 58
Slightly more complicated example (5)
In Bayes’ law, the variable is t, so we should actually think ofp(x1 |T = t) and p(x1, x2 |T = t) as a function of t;
This is called the likelihood; with σ2 = 0.32 we have:
p(x1 |T = t) =1√
0.18πe−(x1−t)2/0.18.
Also assume that the two measurements are independent givenT , so that:
p(x1, x2 |T = t) = p(x1 |T = t) p(x2 | t = t)
=1
0.18πe−
10.18
[(x1−t)2+(x2−t)2].
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk24 / 58
Likelihood for x1 = 18.1
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk25 / 58
Likelihood for x1 = 18.1, x2 = 18.1
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk26 / 58
Likelihood for x1 = 18.1, x2 = 18.4
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk27 / 58
Slightly more complicated example (6)
Bayes’ law gave us:
pT (t | x1, x2) ∝ pT (t)× p(x1, x2 |T = t);
pT (t | x1, x2) is called the posterior distribution
Plots of pT (t)× p(x1, x2 |T = t) next.
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk28 / 58
pT (t)× p(x1 = 18.1 |T = t) andpT (t)× p(x1 = x2 = 18.1 |T = t)
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk29 / 58
pT (t)× p(x1 = x2 = 18.1 |T = t) andpT (t)× p(x1 = 18.1, x2 = 18.4 |T = t)
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk30 / 58
Slightly more complicated example (7)
Now all that is missing is the constant that relates pT (t | x1, x2)to pT (t)× p(x1, x2 |T = t);
Bayes law:
pT (t | x1, x2) =pT (t)× p(x1, x2 |T = t)
p(x1, x2)
tells you that this is1/p(x1, x2) = 1/
∫∞0
pT (t)× p(x1, x2 |T = t) dt
It’s just the integral of the function plotted on the last slides;
It’s there to ensure that∫∞
0pT (t | x1, x2) dt = 1;
1/p(18.1, 18.1) = 55.6, 1/p(18.1, 18.4) = 68.7
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk31 / 58
p(t | x1 = x2 = 18.1) and p(t | x1 = 18.1, x2 = 18.4)
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk32 / 58
Stochastic processes
Many of the processes that we want to model occur over spaceand time:
Audio signals;Rainfall;Financial time seriesImage and video data...
These are modelled probabilistically by stochastic processes
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk33 / 58
Stochastic processes in discrete time
A very common subset is processes that evolve at discrete pointsin time t = 1, 2, 3, . . . ,T ;
Then X1,X2, . . . ,XT are the values of the process at these times;In general, we then have to define a probability distribution on(X1, . . . ,XT );
This is in general difficult because we have to define a T -dimensionalfunction p(x1, . . . , xT )
We can exploit properties of the process to make the modelsimpler to define:
Many processes obey what is called the Markov property;This means that the distribution of Xt only depends on the value ofXt−1;If X1,X2, . . . obeys this property then it’s called a Markov chain.
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk34 / 58
Markov chains
For a Markov chain:
p(x1, . . . , xT ) = p(x1) p(x2 | x1) p(x3 | x2) · · · p(xT | xT−1).
So p(x1, . . . , xT ) is defined in terms of simple one-dimensionaldistributions;If p(xt | xt−1) independent of t then we just need to define p(x1)and p(xt | xt−1);If xt is discrete-valued (say xt ∈ {1, 2, . . . , S}) then p(xt | xt−1)defined in a matrix:
P =
p11 p12 · · · p1S
p21 p22 · · · p2S...
.... . .
...pS1 pS2 · · · pSS
,
where pij = P(Xt = j |Xt−1 = i). Hence each row in P sums to1.
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk35 / 58
Example
The weather each day is: sunny (S), cloudy (C) or rainy (R);
Xt is the weather on day t;
Suppose the weather on day t depends on the weather on dayt − 1, but is independent of earlier days;
It is then a Markov chain and suppose
S C R
P =
0.3 0.5 0.20.25 0.5 0.250.4 0.3 0.3
SCR
xt−1
e.g. P(xt = S | xt−1 = R) = p31 = 0.4;
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk36 / 58
Monte Carlo simulation of our Markov chain
This is easy to do;
Define x1 (let’s make it x1 = R);
Simulate x2 given x1 (using the R row of P) — suppose wegenerate x2 = C ;
Simulate x3 given x2 (using the C row of P);
and so on.
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk37 / 58
40 days of weather
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk38 / 58
A random walk
Markov chains can have Xt continuous;
Example: Xt is normally distributed with mean Xt−1 and avariance σ2;
This is known as a random walk;
On next page is a simulation with X1 = 0 and two values of σ2
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk39 / 58
A normal random walk
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk40 / 58
Autoregressive processes
Random walks have the property that they ’wander away’ from 0(they are non-stationary);
Many physical processes tend to stay around a mean value (theyare stationary);
An autoregressive process is a simple case: Xt is normallydistributed with mean θXt−1 and a variance σ2, where−1 < θ < 1;
Higher order autoregressive processes are also very common e.g.Xt has mean θ1Xt−1 + θ2Xt−2, etc.;
These are a simple model for an audio signal.
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk41 / 58
First order autoregressive processes (with σ = 1)
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk42 / 58
Third order autoregressive processE(Xt) = 0.7Xt−1 + 0.1Xt−2 + 0.15Xt−2
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk43 / 58
Properties of Markov chains
Vast literature in probability theory on properties of Markovchains;
Here, we concentrate on one property;
Suppose I start our weather chain in day 1;
What is the weather on day t + 1? This is P(xt+1 | x1);
It turns out that the matrix of these probabilities is
P t = P × P × · · · × P (matrix multiplication).
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk44 / 58
Properties of Markov chains
Here is P t for t = 2, 20, 200: 0.30 0.46 0.240.30 0.45 0.250.32 0.44 0.24
,
0.30 0.45 0.250.30 0.45 0.250.30 0.45 0.25
,
0.30 0.45 0.250.30 0.45 0.250.30 0.45 0.25
.
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk45 / 58
Stationary distributions
So as you look further into the future:The probability that you are in each state converges to a value;This probability is the same regardless of your current state (so thechain ‘forgets’ where it was at the start);This is called the stationary distribution of the chain (in this case(0.30, 0.45, 0.25));Not all Markov chains have such a distribution;Lots of theory on conditions for which it does happen — includes mostchains that one uses in practice;
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk46 / 58
Simulating the weather Markov chain — proportion ofsunny days
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk47 / 58
Markov chains for Monte Carlo simulation
Note that the proportion of sunny days in the simulationconverges to the stationary probability of 0.302;So there is (very long winded!!) way to simulate from thedistribution (0.30, 0.45, 0.25):
Start this Markov chain in any of the 3 states;Monte Carlo simulate the chain for a ’long’ time;The state of the chain at the end of the long simulation will havedistribution (0.30, 0.45, 0.25);
Hold this thought for later!
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk48 / 58
Audio reconstruction
Think of an audio signal as a discrete process in time X1,X2, . . .;We have some audio data that has been corrupted:
CD or vinyl record scratches, tape degradation;Telephone call over a noisy line;
So what we observe are not the Xt but a corrupted versionY1, . . . ,YT ;
We want to recover the original ’true’ audio signal X1, . . . ,XT
from Y1, . . . ,YT ;
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk49 / 58
Hidden Markov models
A simple model for this process is as follows:Our real audio signal is a stochastic process like the AR model;What we actually observe Yt is normally distributed with mean Xt ;The normal distribution models the noise or degradation of the signal’The Yt ’s are independent of each other given the Xt ’s;
Mathematically:
Yt |Xt , σ2Y ∼ N(Xt , σ
2Y )
Xt |Xt−1, σ2X ∼ N(θXt−1, σ
2X ).
This is an example of a Hidden Markov Model (HMM):There is hidden Markov chain X1,X2, . . .;We observe Yt that are Xt plus some noise, and are independent;
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk50 / 58
Xt – an AR process
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk51 / 58
Xt and Yt
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk52 / 58
Yt alone
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk53 / 58
Bayesian inference (1)
Our data are the Yt ’s;
The likelihood is:
p(y1, . . . , yT | x1, . . . , xt , σ2Y )
=T∏
i=1
1√2πσ2
Y
exp
(− 1
2σ2Y
(yt − xt)2
).
We want to learn about the Xt and the 3 parameters (σ2Y , θ, σ
2X );
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk54 / 58
Bayesian inference (2)
The prior distribution for the Xt is (say) an AR process:
p(x1, . . . , xt | θ, σ2X )
=1√
2πσ2X
e−x21/2σ
2X
T∏i=2
1√2πσ2
X
e−(xt−θxt−1)2/2σ2X .
We have a prior for the 3 parameters p(σ2Y , θ, σ
2X );
Then Bayes’ law gives us:
p(x1, . . . , xT , σ2X , θ, σ
2Y | y1, . . . , yT )
=p(y1, . . . , yT | x1, . . . , xt , σ
2Y ) p(x1, . . . , xt | θ, σ2
X ) p(σ2Y , θ, σ
2X )
p(y1, . . . , yT )
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk55 / 58
Bayesian inference (3)
What is the denominator?
p(y1, . . . , yT )
=
∫p(y1, . . . , yT | x1, . . . , xt , σ
2Y ) p(x1, . . . , xt | θ, σ2
X )
× p(σ2Y , θ, σ
2X ),
a T + 3 dimensional integral;This is in general impossible to compute numerically — too big aproblem!What can we do?
For this problem, there are some algorithms that can compute themeans of the Xi , and approximate the posterior;These break down when we consider more realistic models for Xt andYt |Xt ;Can we Monte Carlo simulate fromp(x1, . . . , xT , σ
2X , θ, σ
2Y | y1, . . . , yT )?
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk56 / 58
Markov chain Monte Carlo simulation
MC simulation from high-dimensional distributions is also verydifficult;
However, it is possible to simulate from a Markov chain withstationary distribution that is the posterior;
So we simulate values of (x1, . . . , xT , σ2X , θ, σ
2Y ) according to a
certain Markov chain;
After we simulate values for a ’long time’, we are sampling fromthe posterior.
For many Bayesian learning problems, this is the only way thatwe know to approximate the solution;
This is known as MCMC (Markov chain Monte Carlo);
MCMC methods are a module in themselves!
Simon Wilson (Trinity College Dublin) Tutorial on Bayesian learning and related methods A pre-seminar for Simon Godsill’s talk57 / 58