Chapter 3: Monte Carlo methods Overview Maarten...
Transcript of Chapter 3: Monte Carlo methods Overview Maarten...
Chapter 3: Monte Carlo methodsMaarten Jansen
Overview
1. Aspects of Monte Carlo Methods
1.1 Monte Carlo integration and importance sampling
1.2 Random number generators (slide 29)
1.2.1 Quantile method (slide 30)
1.2.2 Rejection sampling (slide 37)
2. Markov Chain Monte Carlo Methods
2.1 Markov Chains
2.2 Models for multivariate RV (slide 60)
2.2.1 Markov Random Fields (MRF) (slide 61)
2.2.2 Gibbs Random Fields (GRF) (slide 65)
2.2.3 The Hammersley-Clifford Theorem (slide 68)
2.3 MCMC samplers for integration
2.3.1 Gibbs sampler (slide 79)
2.3.2 Metropolis-Hastings sampler (slide 90)
2.4 Simulated annealing - MCMC optimization
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.1
1. Aspects of Monte Carlo MethodsMonte Carlo simulation or stochastic simulation
• tries to re-formulate a problem such that its solution is the unknown param-
eter of an artificial random variable
• generates instances (an artificial sample) from that random variable
• applies statistical techniques to
– find (estimate) the parameter from the artificial sample
– evaluate the quality of the numerical outcome
• but it is essentially a method from numerical analysis
• Many of the applications of this numerical method come from statistical
problems:
statistical problem numerical solution statistical technique
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.2
Two main categories of problems
• Integration
• Optimization
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.3
1.1 Monte Carlo integration and importance sampling
Suppose we want to evaluate I =
∫ b
ay(x)dx
• Suppose X ∼ uniform[a, b], then I = (b− a) · E(y(X))
• Generate Xi, with i = 1, . . . , n
• Estimate I =b− a
n
n∑
i=1
y(Xi)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.4
Accuracy of the stochastic approximation
We use statistical measures to evaluate the approximation
1. Bias
E(I) =b− a
n
n∑
i=1
E(y(Xi)) = (b−a)·E(y(X)) = (b−a)∫ b
ay(x)· 1
b− a·dx = I
The estimator is unbiased
2. Variance var(I) =(b− a)2
nvar(y(X)) =
(b− a)2
n
∫ b
a
(y(x)− I)2 · 1
b− a· dx
Variance has two components:
– Order of magnitude:
∗ σI = O(n−1/2)∗ typical result for variance of sample mean
∗ Independent from dimension
– Variance of one observation
Two questions
• How does this compare to competitors?
• How can we improve? → not on order of magnitude
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.5
Competitors: numerical integration (= quadrature)
Numerical integration is based on the principle: approximate the integrand by
a function that is easy to integrate.
The approximation is based on a limited number of observations of the inte-
grand only, and it is constructed using interpolation or smoothing.
The error of numerical integration methods depends on several factors
• The smoothness of the integrand, in particular: is the integrand easy to
approximate (see figures below)
• The number of function evaluations or observations n
• The location xi in which integrand is observed or evaluated
• The dimension (curse of dimensionality)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.6
Functions that are difficult to approximate
Functions with
1. infinite slope
2. singularities
3. heavy oscillations
These features require locally dense observations/function evaluations
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.7
A very brief overview of quadrature methods
• For given xi, y(xi), quadrature formulas are based on
– Approximation of the integrand by polynomials:
∗ Rectangular rule or Midpoint rule
∗ Trapezoid rule
∗ Simpson’s rule
– Breaking up the interval [a, b] into subintervals→ composite rules
•When xi are free to choose, order of approximation can be optimised by
chosing the xi to be the zeros of orthogonal polynomials→Gauss Quadra-
ture
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.8
Accuracy of quadrature methods
Assuming that the integrand is “sufficiently smooth”, we have in one dimen-
sion the approximation Iq for I has the following accuracy
|I − Iq| ≤ C · n−1,and for many methods
|I − Iq| ≤ C · n−α,with α > 1
Compare with the precision of the random sampler,[E(I − I)2
]12 ∼ n−
12
Random sampling has two drawbacks
• Slower decay of error
• No hard upperbound for error
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.9
Curse of dimensionality
Observation
If n1 observations (function evaluations) are needed for given accuracy of a
numerical integration technique in one dimension, then the same technique
extended into higher dimensions requires nd1 observations; the error is of the
order of magnitude O(1/n1/d1 )
Reason
• – Accuracy of numerical integration is a deterministic thing: we must cover every
area in the region of integration to be sure that accuracy is met.
– Accuracy thus directly linked to interpoint-distance
– High dimensions means many dimensions in which two points can be distant
from each other.
– Much more observations needed for same interpoint distance
• Quadrature is based on clever approximations of functions. It’s hard to be clever
in high dimensions: hard to find equally good approximations.
No curse of dimensionality for stochastic simulation
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.10
Applications in statistics
• Computation of expected values E(h(X)) =
∫ ∞
−∞fX(x)h(x)dx
• Computation of probabilities P (X ∈ A) = E(χA(X)) =
∫
AfX(x)dx
(χA(X) is the characteristic or indicator function of A)
• Computation of quantiles QX(p) = F−1X (p) with FX(u) =
∫ u
−∞fX(x)dx
These problems appear in
• Bootstrapping and simulation
• Bayesian analysis: computation of posterior means, medians
• . . .
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.11
Non-uniform sampling
We have above the general expression µ = E(h(X)) =
∫ ∞
−∞fX(x)h(x)dx
which we can estimate by µ =1
n
n∑
i=1
h(Xi)
So, if we have an integral I =
∫ b
ay(x)dx
then we can define h(x) as h(x) =y(x) · χ[a,b](x)
fX(x)(if this ratio is bounded near zeros of fX(x))
and estimate I =1
n
n∑
i=1
h(Xi) =1
n
n∑
i=1
y(Xi) · χ[a,b](Xi)
fX(Xi)
where all Xi are IID and have density fX(x).
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.12
Examples
• fX(x) = y(x)/x i.e., h(x) = x
(only possible if y(x)/x is positive with integral equal to 1)
Then I = µX = E(X) =
∫ ∞
−∞x · fX(x) dx and I =
1
n
n∑
i=1
Xi
• h(x) = χA(x),
Then I = p = P (X ∈ A) =
∫
A
fX(x) dx and I =1
n
n∑
i=1
χA(Xi) =#{i|Xi ∈ A}
n
• fX(x) =1
b− a· χ[a,b](x) and take h(x) such that h(x) · fX(x) = y(x) (where we asume
that y(x) is zero outside [a, b] — note that h(x) outside [a, b] is free to choose)
Then I =
∫ b
a
y(x) dx and I =1
n
n∑
i=1
h(Xi) =b− a
n
n∑
i=1
y(Xi)
From these examples, it is clear that there are many ways to estimate an integral. We
formalise this problem.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.13
The importance function
If X has density function fX(x) and we want to estimate
µ = E(h(X)) =
∫ ∞
−∞h(x) · fX(x) dx,
then we may estimate this from a sample Xi as
µ =1
n
n∑
i=1
h(Xi)
If it is easier to sample from fU(u) (for instance, uniform random variables are
easy to generate), then we can write
E(h(X)) =
∫ ∞
−∞h(u) · fX(u) du =
∫ ∞
−∞h(u) · fX(u)
fU(u)· fU(u) du
We call the new sampling distribution fU(u) the importance function
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.14
Importance sampling
With fU(u) an importance funtion, denote w(u) =fX(u)
fU(u)
As a result µ = E(h(X)) =
∫ ∞
−∞h(u) · w(u) · fU(u) du
We can now estimate µ =1
n
n∑
i=1
h(Ui) · w(Ui)
The question is now how to choose fU(u)
• It must be easy to generate samples from it
• The variance of the estimator must be as low as possible
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.15
The variance of importance sampling
The variance equals var(µ) =1
n· var
[h(U ) · w(U )
]
We can develop this as
var(µ) = E([h(U ) · w(U )
]2)−(E[h(U ) · w(U )
])2
= E([h(U ) · w(U )
]2)− µ2 = E([|h(U )| · w(U )
]2)− µ2
≥(E[|h(U )| · w(U )
])2 − µ2 = (E|h(X)|)2 − µ2
The lower bound is independent from fU(u). The inequality becomes an
equality if for V = |h(U )| · w(U ) it holds that E(V 2)= (EV )2, or, var(V ) =
E(V 2)− (EV )2 = 0, thus if V is deterministic (with prob. 1).
So, variance is minimized if |h(U )| ·w(U ) = K, for any random U , i.e., ∀u ∈ R.
We have |h(U )| · w(U ) = K ⇔ fU(u) = |h(u)|·fX(u)K , where K follows from
imposing∫∞−∞ fU(u) du = 1,
Conclusion: minimum variance for fU(u) =|h(u)| · fX(u)∫∞
−∞ |h(u)| · fX(u) du
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.16
Interpretation of this result
• The result is of little immediate use. Indeed, full application requires knowl-
edge of
∫ ∞
−∞|h(u)| · fX(u) du
If h(u) ≥ 0,∀u ∈ R, this is eactly the integral we are after. In the other
case, computation of this integral is probably equally difficult as the original
question.
• var(µ) can be much lower than when estimating µ with samples from fX(x).
• The basic idea is that fU(u) should behave not just as fX(x), but it should
also “follow” |h(u)|. Regions where h(u) is large in magnitude should be
sampled more.
• Pay special attention to tails of |h(u)| · fX(u)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.17
Example with mixture of uniform sampling
• Mixture of L uniform random variables
• Uniform on (non-convex) subdomains Iℓ defined by Iℓ ={x∣∣∣|y(x)| ≥ ℓ/Lmax |y(x)|
}
• mixture probability mass functions pℓ = |Iℓ|/∑L
ℓ=1 |Iℓ|
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.18
Example from Bayesian statistics
Suppose we observe Xi|Mi ∼ N(M,σ2) with σ2 known. We want to estimate
the mean M , for which we impose a Cauchy prior model
fM(m) =1
π(1 + (m− µ)2),
where hyperparameter µ may express prior knowledge of expected values
(could be zero, e.g.)
The conditional sample density is
fX|M(x|m) =n∏
i=1
1√2πσ· e−(xi−m)2/2σ2
=1
(2π)n/2σn· e−
∑ni=1(xi−m)2/2σ2
=1
(2π)n/2σn· e−(x−m)2/(2σ2/n) · e−
∑ni=1(xi−x)2/2σ2
Then the joint distribution is
fM,X(m,x) = fM(m) · fX|M(x|m)
=1
π(1 + (m− µ)2)· 1
(2π)n/2σn· e−(x−m)2/(2σ2/n) · e−
∑ni=1(xi−x)2/2σ2
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.19
Marginal distribution in Bayesian model
So the marginal distribution of X becomes
fX(x) =
∫ ∞
−∞fM,X(m,x) dm =
∫ ∞
−∞fM(m) · fX|M(x|m) dm
= C(x) ·∫ ∞
−∞
1
1 + (m− µ)2· e−(x−m)2/(2σ2/n) dm
with C(x) =1
π· 1
(2π)n/2σn· e−
∑ni=1(xi−x)2/2σ2
(Note that the integral exists thanks to the rapid decay of the normal bell curve)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.20
Bayes’ rule: posterior = joint/marginal
The posterior distribution of M , given the observation X becomes
fM |X(m|x) =fM(m) · fX|M(x|m)
fX(x)
=fM(m) · fX|M(x|m)∫ ∞
−∞fM(m) · fX|M(x|m) dm
=C(x) · 1
1 + (m− µ)2· e−(x−m)2/(2σ2/n)
C(x) ·∫ ∞
−∞
1
1 + (m− µ)2· e−(x−m)2/(2σ2/n) dm
=
1
1 + (m− µ)2· e−(x−m)2/(2σ2/n)
∫ ∞
−∞
1
1 + (m− µ)2· e−(x−m)2/(2σ2/n) dm
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.21
Bayesian estimation
Possible values of interest are
the posterior mean
E(M |X = x) =
∫ ∞
−∞
m
1 + (m− µ)2· e−(x−m)2/(2σ2/n) dm
∫ ∞
−∞
1
1 + (m− µ)2· e−(x−m)2/(2σ2/n) dm
and the posterior variance which is var(M |X = x) = E(M 2|X = x) −[E(M |X = x)
]2with
E(M 2|X = x) =
∫ ∞
−∞
m2
1 + (m− µ)2· e−(x−m)2/(2σ2/n) dm
∫ ∞
−∞
1
1 + (m− µ)2· e−(x−m)2/(2σ2/n) dm
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.22
Monte-Carlo procedure for integrals
At least two possibilities:
• Sample from normal density with expected value x and variance σ2/n:
Xn ∼ N(x, σ2/n), Then
E(Mk|X = x) =E(
Xkn
1+(Xn−µ)2)
E(
11+(Xn−µ)2
)
• Sample from Cauchy density with center (median) µ fU(u) = 1/[π(1+ (u−
µ)2)]
Then
E(Mk|X = x) =E(uk · e−(u−x)2/(2σ2/n)
)
E(e−(u−x)2/(2σ2/n)
)
• Sample from another distribution “as close as possible” to the integrand.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.23
Arguments for using the normal sampler
In this case, the normal density is a much better choice than the Cauchy
• The tails of the integrand are lighter than the normal tail, so the heavy
tails of the Cauchy produce a lot of large samples whose values are not
representative for the integral
• The normal density has sample size (n) dependent variance, so that sam-
ples get more concentrated for large n, which corresponds to the true shape
of the integrand
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.24
Experiment: normal vs. Cauchy samplers
As an illustration, we plot the estimates of the standard errors in estimating
the following parameter
I =
∫ ∞
−∞
1
π · (1 + (u− µ)2)· 1√
2πσ/√n· e−(u−x)2/(2σ2/n) du
= E
(1
π · (1 + (Xn − µ)2)
)
= E
(1√
2πσ/√n· e−(U−x)2/(2σ2/n)
)
with Xn ∼ N(x, σ2/n) and U ∼ Cauchy(µ, 1)
We simulate Xn,i and Ui for i = 1, . . . , nMC and define
I1 =1
nMC
nMC∑
i=1
1
π · (1 + (Xn,i − µ)2)
I2 =1
nMC
nMC∑
i=1
1√2πσ/
√n· e−(Ui−x)2/(2σ2/n)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.25
Precision of the estimators
We can easily estimate the variances
var(I1) = 1nMC
var(
1π·(1+(Xn−µ)2)
)
var(I2) = 1nMC
var(
1√2πσ/
√n· e−(Ui−x)2/(2σ2/n)
)
The estimates of the standard errors of a single observation (to be divided by√nMC) are depicted below (together with the log of the estimates, to better
show the behavior)
0 50 100 150 2000
0.1
0.2
0.3
0.4
n
estimated st.dev. of one observation
Cauchy samplesNormal samples
0 50 100 150 200-8
-6
-4
-2
0
n
log(estimated st.dev.) of one observation
Cauchy samplesNormal samples
Interpretation For n growing (this is not nMC), the tail of the integrand be-
comes lighter, making the Cauchy sampler less and less attractive. The nor-
mal sampler comes closer to the integrand.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.26
Importance function must have sufficiently heavy tails
Previous examples have illustrated that it is of little use that the sampling den-
sity function has heavier tails than the integrand.
The opposite, is however, much worse (so if no perfect match can be realized,
a slightly too heavy tail is preferable)
We have var(µ) =1
n· var
[h(U ) · w(U )
]=
1
n· E[(h(U ) · w(U ))2
]− µ2
Herein
E[(h(U ) · w(U ))2
]=
∫ ∞
−∞[h(u)]2[w(u)]2fU(u) du
=
∫ ∞
−∞[h(u)]2
fX(u)2
fU(u)du
=
∫ ∞
−∞
h(u)fX(u)
fU(u)· h(u)fX(u) du
If h(u)fX(u) has a heavier tail than fU(u), then the first factor tends to infinity
for u → ∞. The integral may then be large or even infinity, depending on the
tail of h(u)fX(u)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.27
Conclusions about importance sampling
Importance sampling allows to
• estimate expected values (integrals) with a random variable X whose dis-
tribution does not allow easy simulations, by drawing from another random
variable U which is easier, followed by proper re-weighting.
• optimize (to some extend) the choice of sample distribution to estimate in-
tegrals.
We will later discuss rejection sampling, which also samples from an auxiliary distri-
bution. The outcome is then rejected or accepted with an appropriate probability such
that, a posteriori, (given the event of rejection or acceptance) the variable takes the
aimed distribution. Unlike importance sampling, the correction of rejection sampling
thus proceeds at the level of the random number generator itself (and not at the level
of computing the integral). We therefore discuss random number generators.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.28
1.2 Random number generators
Importance sampling assumes that we can generate numbers from a given
distribution. How can we do that?
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.29
1.2.1 Quantile or inversion method
Theorem If U ∼ uniform[0, 1] and QX(p) is the quantile function of X , then
QX(U ) has the same distribution as X , i.e.: U ∼ uniform[0, 1]⇔ QX(U )d= X
Proof uses monotonicity of QX(p) or its inverse FX(x)
P (QX(U ) ≤ x) = P (U ≤ Q−1X (x)) = FU(Q−1X (x)) = Q−1X (x) = FX(x)
Example 1: Let X ∼ exp(λ), then FX(x) = 1−e−λx, so QX(p) = − log(1−p)/λ,
so if U ∼ uniform[0, 1], then take
X = − log(1− U )/λ,
or, because 1− U is also uniform (for symmetry), we can take
Y = − log(U )/λ
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.30
Quantile or inversion method(2)
Example 1: Let X ∼ Cauchy with median µ, i.e., fX(x) =1
π[1+(x−µ)2]
If µ 6= 0, then X can be generated by adding µ to a Cauchy random variable
with median 0.
So, we assume that µ = 0.
Then FX(x) =12+ 1
πarctan(x), and QX(U ) = tan [π(U − 1/2)]
Note:if X ∼ Cauchy(µ = 0) then −X ∼ Cauchy(µ = 0) and
1/X ∼ Cauchy(µ = 0).
Indeed, for Y = 1/X, fY (y) = fX(x(y))
∣∣∣∣dx(y)
dy
∣∣∣∣ =1
π
1
1 + 1/y21
y2=
1
π
1
1 + y2
So, if X ∼ Cauchy(µ = 0) then Y = −1/X ∼ Cauchy(µ = 0).
And if X = tan [π(U − 1/2)] ∼ Cauchy(µ = 0), then Y = −1/X = tan(πU ) ∼Cauchy(µ = 0)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.31
Example: Box-Muller transform for normal (1)
Problem: normal CDF FZ(z) = Φ(z) has no closed formula, working with
quantile QZ(U ) is not possible, unless software provides detailed tables of
QZ(p)
Solution: we go for two independent normal RV: (Z1, Z2) ∼ N2(0, I2), then
we know:
• Z21 + Z2
2 ∼ χ2(2) = exp(1/2)Indeed
1. P (Z2 < x) = P (−√x < Z <
√x) = Φ(
√x)− Φ(−
√x)
Hence fZ2(x) =[φ(√x) + φ(−
√x)]/(2√x) = e−x/2/
√2πx
which is: Z2 ∼ χ2(1) = Γ(1/2, 1/2)
2. Let Y = Z21 + Z2
2 , then
fY (y) =
∫ y
0
fZ2
1
(z)fZ2
2
(y − z) dz =
∫ y
0
e−z/2√2πz
e−(y−z)/2√2π(y − z)
dz
=e−y/2
2π
∫ y
0
dz√z(y − z)
=e−y/2
2π
∫ 1
0
dt√t(1− t)
=e−y/2
2πB(1/2, 1/2) =
e−y/2
2π
Γ(1/2)Γ(1/2)
Γ(1/2 + 1/2)= e−y/2/2
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.32
Example: Box-Muller transform for normal (2)
•√Z21 + Z2
2 ∼ Rayleigh
Indeed, let Y =√Z21 + Z2
2 , then
FY (y) = P (Y ≤ y) = P (Y 2 ≤ y2) = 1− e−y2/2
because F (x) = 1− eλx is the CDF of the exponential distribution
• Z1
Z2∼ Cauchy(µ = 0)
Indeed, let X = Z1/Z2, so, Z1 = XZ2, then
FX(x) = P (X ≤ x) =
∫ ∞
−∞fZ2
(z)P (X ≤ x|Z2 = z)dz
=
∫ 0
−∞fZ2
(z)P (Z1 ≥ zx)dz +
∫ ∞
0
fZ2(z)P (Z1 ≤ zx)dz
= 2
∫ ∞
0
fZ2(z)P (Z1 ≤ zx)dz
and so, fX(x) = 2
∫ ∞
0
fZ2(z)fZ1
(zx) z dz =2
2π
∫ ∞
0
ze−(1+x2)z2/2dz =1
π
∫ ∞
0
e−udu
1 + x2
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.33
Example: Box-Muller transform for normal (3)
We propose to generate a Cauchy RV X1 and an exponential RV X2 ∼ exp(1/2),
using X1 = tan(2πU2) and X2 = −2 log(U1)
Note that X1 = tan(2πU2) is Cauchy with the same parameters as X ′1 =
tan(πU2), since tan(πu) has period 1. We take X1 = tan(2πU2) instead of
X1 = tan(πU2) for reasons explained below.
Then solve the system
{Z1/Z2 = X1 = tan(2πU1)
Z21 + Z2
2 = X2 = −2 log(U2) = log(1/U 22 )
So suppose U1 and U2 are 2 independent, uniform r.v. on [0, 1] and let
Z1 =√log(1/U 2
2 ) cos(2πU1)
Z2 =√log(1/U 2
2 ) sin(2πU1).
Here sin(2πU1) and cos(2πU1) have the same distribution on [−1, 1]. This would
not be the case for cos(πU1) ∈ [0, 1]. This is why we take X1 = tan(2πU1).
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.34
Example: Box-Muller transform for normal (4)
Doublecheck: we are given the R
2 → R
2 transformation Z = g(U ) =√log(1/U 2
2 ) ·[cos(2πU1)
sin(2πU1)
].
The inverse g−1:
{U2 = e−
12(Z
21+Z2
2)
U1 = 12π arctan
(Z2
Z1
).
Using ddx arctan(x) =
11+x2, we find (Z1, Z2) ∼ NID(0, 1) :
fZ1,Z2(z1, z2) = fU1,U2
(12π arctan
(z2z1
), e−
12(z
21+z22)
)|J |
= 1 ·∣∣∣∣∣det
[∂u1
∂z1∂u1
∂z2∂u2
∂z1∂u2
∂z2
]∣∣∣∣∣
=
∣∣∣∣∣det[
12π · 1
1+(z2/z1)2· −z2
z21
12π · 1
1+(z2/z1)2· 1z1
e−12(z
21+z22) · (−z1) e−
12(z
21+z22) · (−z2)
]∣∣∣∣∣= 1
2π· e−1
2(z21+z22) · 1
1+(z2/z1)2·(z22z21+ 1)
= 1√2πe−
z212 · 1√
2πe−
z222 .
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.35
1.2.2 Rejection sampling
Suppose we want to generate random numbers with density f(x) and cumu-
lative distribution F (X).
Theorem
Let X ∼ gX and ∀x ∈ R : f(x) ≤ M · gX(x) and let U ∼ uniform[0, 1],
independent from X , then F (x) = P
(X ≤ x
∣∣∣∣U ≤f(X)
M · gX(X)
)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.36
Rejection sampling: proof of the theorem
F (x) = P
(X ≤ x
∣∣∣∣U ≤f(X)
M · gX(X)
)
Proof
P(X ∈ A & U ≤ f(X)
M ·gX(X)
)=
∫
AgX(x) · P
(U ≤ f(X)
M · gX(X)
∣∣∣∣X = x
)dx
=
∫
AgX(x) · P
(U ≤ f(x)
M · gX(x)
)dx
=
∫
AgX(x) ·
f(x)
M · gX(x)dx
=
∫
A
f(x)
Mdx =
1
M
∫
Af(x) dx
Hence, if A =]−∞, x], then
P(X ∈ A
∣∣∣U ≤ f(X)M ·gX(X)
)=
P(X∈A & U≤ f(X)
M ·gX (X)
)
P(U≤ f(X)
M ·gX(X)
) =P(X∈A & U≤ f(X)
M ·gX (X)
)
P(X∈R & U≤ f(X)
M ·gX (X)
)
=
1
M
∫
Af(x) dx
1/M=
∫
Af(x) dx = F (x)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.37
Algorithm
Situation and aim
We want a random number X with density function f(X). We have no expression for
F (x) that we can invert. We can generate numbers according to a different law gX(x)
and we know that f(x) ≤M · gX(x), for all values of x.
Pseudo-code
continue-search = TRUE
While continue-search
• Generate X ∼ gX
• Generate U ∼ uniform[0, 1]
• If U ≤ f(X)/[M · gX(X)
]
then continue-search = FALSE
The output is X
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.38
How to choose gX(x)?
•X ∼ gX should be easy to generate
• gX(x) should be as close as possible to f(x), such that M can be close
to 1, and rejection probabilities are low. Otherwise, computational efforts
increase.
• Some combinations don’t work: for instance, one can never generate Cauchy
variables by rejection sampling applied to normal variables, simply because
there is no M satisfying the condition.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.39
Example 1: generating Gamma-distributed r.v.
Let X ∼ Gamma(λ, α), i.e.,fX(x) = xα−1λαe−λx
Γ(α)
If α is integer, then we can write X =α∑
i=1
Xi with independent Xi ∼ exp(λ)
If α is not integer, denote δ = α− ⌊α⌋ and r = ⌊α⌋Then we can decompose or generate X as
X = (Xr +Xδ)/λ with Xr ∼ Gamma(1, r) and Xδ ∼ Gamma(1, δ) and both
independent.
We can generate Xr as sum of exponentials, but for Xδ, the quantile method
does not work, so we need another direction.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.40
Generating Gamma values with small α
The distribution function of X ∼ Gamma(1, δ) is fX(x) =xδ−1e−x
Γ(δ)It is depicted below for δ = 0.23
0 2 4 60
5
10
15
Not straightforward to bound by some M · gX(x)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.41
A mixture distribution as upper bound
We will use a mixture distribution. Suppose
V ∼ uniform[0, 1]
S = χ[0,p](V ) for some value p
X1 ∼ g1 with g1(x) = δ · xδ−1 on [0, 1]
X2 − 1 ∼ exp(1), hence g2(x) = e−(x−1) = e · e−x on [1,∞[
X = S ·X1 + (1− S) ·X2
In other words, X = X1 with probability p and X = X2 with probability 1− p.
Therefore, generate two uniform RV: V and W . If V < p, then X = QX1(W ),
otherwise X = QX2(W ).
In one formula: X = I(V < p) QX1(W ) + I(V ≥ p) QX2
(W )
where QX1(W ) = W 1/δ and QX2
(W ) as on slide 30
Remark We can generate W and S = I(V < p) both from V in a mutually
independent way. In particular, let W = SV/p + (1 − S)(1 − V )/(1 − p), then
W is independent from S.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.42
The mixture distribution and density
The cumulative distribution of X , denoted as GX(x) is then
GX(x) = P (S = 0) · P (X ≤ x|S = 0) + P (S = 1)P (X ≤ x|S = 1)
= (1− p) · P (X2 ≤ x|S = 0) + p · P (X1 ≤ x|S = 1)
= (1− p) ·G2(x) + p ·G1(x)
GX(x) = (1− p) ·G2(x) + p ·G1(x)
and from there gX(x) = p · g1(x) + (1− p) · g2(x)In our case gX(x) = p · δ · xδ−1 · χ[0,1](x) + (1− p) · e · e−x · χ[1,∞[(x)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.43
Optimizing the parameters in the function gX(x)
The value of p can be chosen to minimize the number of rejections, i.e., to minimize M .
•We need that M · gX(x) ≥ fX(x)
• For x ∈ [0, 1], this means that
M · p · δ · xδ−1 ≥ e−x · xδ−1Γ(δ)
⇔M ≥ e−x
pδΓ(δ)
The maximum in the right hand side is reached if x = 0, hence M ≥ 1
pδΓ(δ)
• For x ≥ 1, this becomes
M · (1− p) · e · e−x ≥ e−x · xδ−1Γ(δ)
⇔M ≥ xδ−1
(1− p)eΓ(δ)
The maximum in the right hand side is reached if x = 1, hence M ≥ 1
(1− p)eΓ(δ)
• The minimum M can be obtained if both lower bounds for M are equal, i.e., if
pδΓ(δ) = (1− p)eΓ(δ)⇔ p =e
e + δ
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.44
The resulting algorithm
We have M =1
pδΓ(δ)=
e + δ
eδΓ(δ)So, for x ∈ [0, 1], we findfX(x)
M · gX(x)=
e−x · xδ−1/Γ(δ)M · p · δ · xδ−1 = e−x which is smaller than 1
and for x > 1, we findfX(x)
M · gX(x)=
e−x · xδ−1/Γ(δ)M · (1− p) · e1−x = xδ−1 which is smaller than 1 because δ − 1 is negative.
While search == true,
• Generate independent U, V,W ∼ unif([0, 1])
• if V < p = e/(e + δ),
then
– X = W 1/δ
– If U < e−X , then search← false
else
– X = − log(W )
– If U < Xδ−1, then search← false
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.45
Example 2: computing the integral on page 25
In Bayesian inference with a Cauchy prior and normal errors, we have to com-
pute a ratio of the form
r =
∫ ∞
−∞xm · 1
bπ ·[1 +
(x−ab
)2] ·1√2πσ· e−(x−µ)2/2σ2
dx
∫ ∞
−∞
1
bπ ·[1 +
(x−ab
)2] ·1√2πσ· e−(x−µ)2/2σ2
dx
Using rejection sampling, we will generate data X from a distribution propor-
tional to
fX(x) = K · 1
1 +(x−ab
)2 · gX(x),
where gX(x) is the normal density function.
It then holds that r = E(Xr)
We can draw observations from X even if we know fX(x) only up to a constant.
(see next slide)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.46
Example 2: drawing observations without knowing the
normalisation constant
We can draw observations from X even if we know fX(x) only up to a constant.
Indeed, let fX(x) = K · f(x) with K unknownfX(x)
M · gX(x)=
K · f(x)M · gX(x)
Herein f(x) and gX(x) are known and f(x)/gX(x) is known to be bounded by
C, then take M ≥ KC.
In the example above
fX(x)
M · gX(x)=
K · gX(x) · 1
1+(x−ab )2
M · gX(x)=
K
M[1 +
(x−ab
)2]
Take M = K, then the result is bounded by 1. (Note that M = K remains
unknown)
So generate X ∼ gX , then check if
1
1+(X−ab )2 ≤ U
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.47
2. Markov Chain Monte Carlo Methods
• Monte Carlo Methods are based on independent sampling, law of large
numbers, central limit theorem
• Independent sampling may be difficult to realize, especially when we sam-
ple from a large dimensional vector X
• Markov Chain Monte Carlo (MCMC) Methods simulate a sequence of
dependent observations
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.48
2.1 Markov Chains
Discrete time Markov Chain←→ continuous time Markov Chain
We consider discrete time MC
Discrete state space MC←→ general state space MC
A Discrete state space MC is a sequence of RV’s (Xn;n ∈ N) for which
Xn ∈ E. The state space E is countable and thus homomorphic with Z.
(We can take E = Z.) The sequence satisfies the Markov condition, i.e.,
P (Xn+1 = j|X0 = i0, X1 = i1, . . . , Xn = in) = P (Xn+1 = j|Xn = in)
Define P(n)ij = P (Xn+1 = j|Xn = i)
The Markov Chain is stationary or homogeneous if P(n)ij does not depend on
n. We can write Pij = P (Xn+1 = j|Xn = i)
The matrix with elements Pij is called the transition matrix
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.49
Irreducibility
(We further assume stationary MC, unless otherwise stated)
n-step Transitions
If P is the transition matrix of a discrete space Markov process, then
P (Xm+n = j|Xm = i) = (P n)ij
Accessibility
A state j is accessible from a state i if ∃n ∈ N, such that (P n)ij > 0. We
denote i→ j
Two states are communicating if they are mutually accessible from each
other. We denote i↔ j
If all states communicate, the MC is said to be irreducible
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.50
Period
The smallest di for which(P di)ii> 0 is called the period of state i. It follows
that (P n)ii > 0⇔ n = k · di with k ∈ N
If i↔ j, then di = dj
Corollary An irreducible MC has the same period for all its states.
Proof
∃n ∈ N for which (P n)ij > 0 and ∃m ∈ N for which (Pm)ji > 0
Now suppose that (P r)jj > 0, then (P n+r+m)ii > (P n)ij · (P r)jj · (Pm)ji > 0, hence we
know that r = ki · di and n + r +m = kj′ · dj.We also have
(P 2r)jj
> (P r)jj · (P r)jj > 0, hence n + 2r + m = kj′′ · dj, and so r =
(n + 2r +m)− (n + r +m) = (kj′′ − kj′) · dj = kj · dj.So, any r = ki · di can be written as r = kj · dj
A similar argument leads to the conclusion that any r = kj ·dj can be written as r = ki ·di.This is only possible if dj = di
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.51
Transient states
Denote (random variable) the first revisit
Tii the first n > 0 so that Xn = i given that X0 = i
We know that Tii = k · di and P (Tii = k · di) > 0.
Denote Vii =∞∑
k=1
I(Xk·di = i)|{X0 = i} =∞∑
n=1
I(Xn = i)|{X0 = i}
(with I(A) the indicator function of event A)
A state i is transient if E(Vii) <∞, that is, if an infinite number of steps in
the Markov Chain leads at most to a finite number of visits to state i.
E(Vii) =∞∑
n=1
E(I(Xn = i)|X0 = i) =∞∑
n=1
P (Xn = i|X0 = i)
So
E(Vii) <∞⇔∞∑
n=1
P (Xn = i|X0 = i) <∞
This is equivalent to P (Tii <∞) < 1
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.52
Transient states have a limited expected number of visits
E(Vii) <∞⇒ P (Tii <∞) < 1
Proof
Suppose that P (Tii < ∞) = 1, and denote T(r)ii the number of steps until the
rth occurence of state i. Then, because of the Markov condition,
T(r)ii =
∑rℓ=1 Tii,ℓ with Tii,ℓ IID observations from Tii.
P (T(r)ii <∞) = P
(r⋂
ℓ=1
(Tii,ℓ <∞)
)=
r∏
ℓ=1
P (Tii,ℓ <∞) = 1 for any finite r.
So Vii ≥ r, a.s. for any r ∈ N, hence E(Vii) =∞. �
This implies that µii = E(Tii) =∞ but the opposite does not hold (see next
slide).
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.53
Recurrent states
A state is called recurrent if it is not transient, i.e., if it is visited an infinite
number of times.
If the expected time until the first visit is infinite, i.e., if µii = E(Tii) = ∞, then
the state is called null-recurrent, otherwise it is called positive or ergodic.
A null-recurrent state is visited an infinite number of times, but the relative
number of visits tends to zero: E(Vii) =∞∑
n=1
(P n)ii =∞ and1
N
N∑
n=1
(P n)ii → 0
A positive state has 1N
∑Nn=1 (P
n)ii → 1µii
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.54
Proof (as of yet incomplete)
We prove that in a positive state, it holds that1
N
N∑
n=1
P (Xn = i|X0 = i)→ 1
E(Tii)
• Law of total probability + Markov condition for n > 0:
P (Xn = i|X0 = i) =∑n
k=1 P (Xn = i|Xk = i) · P (Tii = k)
=∑n
k=1 P (Xn−k = i|X0 = i) · P (Tii = k)
Also, P (Xn = i|X0 = i) = 1 for n = 0.
• If we define tk = P (Tii = k), with t0 = 0 and pn = P (Xn = i|Xk = i), then we have
pn =∑n
k=1 pn−k · tk =∑n
k=0 pn−k · tk for n > 0 and p0 = 1 6= p0 · t0 = 0.
• Denoting t = (tk, k ∈ N) and p = (tn, n ∈ N), then the sum above is the convolution of
the sequences t and p: t ∗ p. Since the expression does not hold for n = 0, we have
to correct with a Kronecker sequence δ0 = (1, 0, 0, 0, . . .). We get: p = t ∗ p + δ0
• Denote a(s) =
∞∑
k=0
aksk, then the equation above becomes p(s) = p(s) · t(s) + 1 ⇔
p(s) =1
1− t(s)
• Since t(1) =∑∞
k=1 P (Tii = k) = P (Tii < ∞), a recurrent Markov process has a
singularity in for p(s) in s = 1. Further, lims→1
(1 − s) · p(s) = lims→1
1− s
1− t(s)=
1
t′(1)and
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.55
t′(1) =∑∞
k=1 k · P (Tii = k) = E(Tii) = µii
• On the other hand,
lims→1(1− s) · p(s) = limu→∞1u · p(1− 1/u) = limn→∞
1n ·∑∞
k=0 pk(1− 1/n)k
= limn→∞1n·∑n
k=0 pk + limn→∞1n·∑n
k=0 pk[(1− 1/n)k − 1
]
+ limn→∞1n ·∑∞
k=n+1 pk(1− 1/n)k
• ...
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.56
Equilibrium distribution
Theorem In an irreducible discrete time, discrete state space MC the states
are either all transient, all null-recurrent, or all positive (ergodic).
All finite state MC are positive
Denote pn,i = P (Xn = i), and row vector pn = (. . . , pn,i . . .), then pn+1 = pn · Ppn+1,i =
∑
ℓ∈Z
Pℓipn,ℓ
P · 1 = 1 (because transition probabilities sum to one)
λ = 1 is an eigenvalue
The left eigenvector is an invariant or stationary or equilibrium distribution
p · P = p
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.57
Reversed Markov Processes
If (Xn;n = 0, . . .) is a Markov Chain with transition matrix P and equilibrium
distribution p, then
P (Xn = j|Xn+1 = i,Xn+2 = i2, . . . , Xn+m = im)
= P (Xn = j|Xn+1 = i) =pjpi· Pji
Proof (Bayes, Chain rule for conditional probabilities and Markov Condition)
P (Xn = j|Xn+1 = i,Xn+2 = i2, . . . , Xn+m = im)
=P (Xn = j) · P (Xn+1 = i,Xn+2 = i2, . . . , Xn+m = im|Xn = j)
P (Xn+1 = i,Xn+2 = i2, . . . , Xn+m = im)
=P (Xn = j) · P (Xn+1 = i|Xn = j) · P (Xn+2 = i2, . . . , Xn+m = im|Xn+1 = i)
P (Xn+1 = i) · P (Xn+2 = i2, . . . , Xn+m = im|Xn+1 = i)
= P (Xn = j|Xn+1 = i) =P (Xn = j) · P (Xn+1 = i|Xn = j)
P (Xn+1 = i)=
pj·Pji
pi
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.58
Reversible Markov Processes
If there exists a distribution pi = P (X = i) that satisfiespj · Pji
pi= Pij
then the Markov chain is called Reversible
Remark Reversibility thus means not that the reversed Markov process exists (it al-
ways exists), but that its transition probabilities for i → j are the same as the forward
probabilities for the same transitions i→ j (so NOT for j → i)
The distribution is then the equilibrium distribution. Indeed, from summation
of pj · Pji = piPij we obtain:∑
j
pj · Pji = pi∑
j
Pij = pi, which is, in matrix
form, p · P = p, the invariant distribution equation.
The reverted process is of course the same.
The conditionpj · Pji
pi= Pij is called the detailed balance equation (since it
implies the “global” balance equation)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.59
2.2 Models for multivariate random variables
1. In the next slides we consider vectors X with multivariate distributions.
We discuss two ways to define/fix any multivariate distribution
•Markov Random Field (MRF), which is special case of a graphical
model
• Gibbs Random Field (GRF)
2. The Markov property (dependence through adjacency) plays a role both
on the level of the sampling process as on the level of the sampled multi-
variate random variable: Markov Chains for the sampling, Markov Random
Fields for the sampled variable
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.60
2.2.1 Markov Random Field (MRF)
Given a multivariate random variable X, a graphical model can be used to
represent the intra-dependencies.
An undirected graph is a ordered pair of sets G = (V, E), where V =
{1, . . . , p} is the set of vertices, sites or nodes, which are here indices into
X. The set E contains the (undirected) edges in the graph, where an undi-
rected edge is an unordered pair of vertices.
In a Markov Random Field, two vertices i and j are connected by an edge
if and only if the corresponding components of x are conditionally dependent,
i.e., given all the other components’ values.
P(Xi = xi
∣∣{X1, . . . , Xp}\{Xi})6= P
(Xi = xi
∣∣{X1, . . . , Xp}\{Xi, Xj})
The two sites are then called neighbours.
A neighbourhood of i is defined as ∂i = {j|{i, j} ∈ E}Formally, denoting by 2V all subsets of V , we have
∂ : V → 2V : i 7→ ∂i = {j|{i, j} ∈ E}
Markov property: it holds that P(Xi = xi
∣∣XV \{i})= P
(Xi = xi
∣∣X∂i
)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.61
Examples of MRFs
• In principle any multidimensional probability distribution can be seen as a
MRF. In general, all components are conditionally dependent, so
∂i = {1, . . . , p}\{i}• A (finite sample from a) Markov Chain is also a MRF. Indeed, (thanks to the
notion of reversed Markov Processes)∂i = {i− 1, i + 1}– Forward Markov Chain: • → • → • → •– Reversed MC: • ← • ← • ← •– MRF representation: • − • − • − •
• A two-dimensional MRF:• − • − • − •| | | |• − • − • − •| | | |• − • − • − •| | | |• − • − • − •
– Dimension of random vector X is p = 16
– X has a 2D-geometric background
– Components of X can be represented with
a 2D index: Xs = X(i,j)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.62
A short note on graphical models
Markov Random Fields are an example of graphical models
Graphical models are used to define or represent multivariate random vari-
ables X
MRF are undirected graphs, edges define neighbourhoods ∂i
MC (Markov Chains) are an example of Bayesian networks: directed, acyclic
graphs: Edges define parents of nodes par(i)
The construction of the joint probability in a directed graph is immediate
fX(x) =
p∏
i=1
fXi|Xpar(i)(xi|xpar(i))
• when par(i) = Ø, then the conditional distribution should be interpreted as
the marginal distribution
• The construction is always possible because the graph is acyclic
• For MRF’s/undirected graphs, the joint pdf/pmf is not so straightforward, we
need the concept of Gibbs Random Fields (see slide 65)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.63
Example of modelling by Bayesian networks
Let X = (X1, X2, X3), then
• the graph • ← • → • represents the situation where X1 and X3 are de-
pendent, but, given the value of X2, (=conditionnally) they are independent.
The dependence occurs through X2
• the graph • → • ← • represents the situation where X1 and X3 are inde-
pendent, but X2 depends on both. If X2 is observed, this gives information
on both X1 and X3, so X1 and X3 are conditionnally dependent. (By obser-
vation of X2, we learn about both X1 and X3)
These models are used, for instance, in studies of causality, and are popular
in several (other) domains of statistical learning
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.64
2.2.2 Gibbs Random Field (GRF)
Let X be a multivariate random variable of dimension p, and let E be a set of
edges defined on V = {1, . . . , p}.Unlike in MRF, the edges in a GRF are not defined on the basis of a conditional
probability. They are used to define the global probability, as follows:
A clique (or complete subset) is defined as
C ⊂ V is a clique ⇔ ∀i ∈ C : C ⊂ {i} ∪ ∂i
The set of cliques is denoted as C C = {C ⊂ V |∀i ∈ C : C ⊂ {i} ∪ ∂i}A probability distribution that can be decomposed into factors associated with
the cliques is called a Gibbs Random Field (GRF)
fX(x) is a GRF ⇔ fX(x) =1
Z
∏
C∈CfC(xC) =
1
Zexp
(−∑
C∈CHC(xC)
)
The functions HC(xC) are (up to constant) the logarithms of fC(xC). They are
called clique potentials. The normalizing constant Z is called a partition
function.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.65
Gibbs Random Field - further discussion
Use of GRF’s
• GRF’s can be used to define a joint probability on an undirected graph
• MRF’s represent local, conditional probabilities
• THe Hammersley-Clifford theorem (slide 69) finds connection GRF-MRF
Examples of GRF’s
• In principle any multidimensional probability distribution can be seen as
a GRF. In general, all components are conditionally dependent, and the
cliques are all subsets of V . All clique potentials are zero, except for C = V ,
whose potential is HV (x) = − log(fX(x)).
• Ising model (see slide 67)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.66
Example of GRF: Ising model
A two dimensional lattice {(i, j)|0 ≤ i ≤ m, 0 ≤ j ≤ n} (see slide 62) can be
equiped with a neighbourhood system by defining for each internal site
∂(i, j) = {(i− 1, j), (i + 1, j), (i, j − 1), (i, j + 1)}The cliques are then singletons and (horizontal and vertical) pairs of sites
C ={{(i, j)}
}∪{{(i, j), (i + 1, j)}
}∪{{(i, j), (i, j + 1)}
}
In the case where the observations are binary, say X(i,j) ∈ {−1, 1}, a popular
GRF model is the Ising model
HC(xC) = τ · xC,1 · xC,2 for the pairs and Hs(xs) = γ · xs for the singletons.
The pair’s potentials express the interaction between adjacent sites, while the
singleton potentials express a drift towards one of the two states.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.67
2.2.3 The Hammersley-Clifford Theorem: conditions
MRFs are defined by conditional probabilities, based on a neighbourhood sys-
tem.
GRFs are defined by a joint probability, decomposed into clique potentials.
The Hammersley-Clifford Theorem states that under mild conditions, both def-
initions are equivalent, i.e., a MRF is also a GRF and vice versa.
Two important conditions: existence of joint pdf + positivity
Existence of fX(x): See slide 77
Positivity condition
A probability distribution is said to satisfy the positivity condition if
∀i = 1, . . . , p : fXi(xi) > 0 implies that for x = (x1, . . . , xi, . . . , xp)
we have fX(x) > 0
A counterexample of such a distribution is a uniform distribution on the unit
disk: fX(0.9, 0.8) = 0 although fX1(0.9) > 0 and fX2
(0.8) > 0
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.68
The Hammersley-Clifford Theorem
Theorem
If fX(x) exists and satisfies the positivity condition, then X is a MRF
with neighbourhood system ∂ if and only if it is a GRF whose cliques
C follow from the neighbourhood system ∂.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.69
⇐: GRF → MRF
Suppose X is a GRF with cliques C based on neighbourhood system ∂. Further denote
I = {1, . . . , p}, and i∂i = {i} ∪ ∂i. Let Ci = {C ∈ C|i ∈ C} be the cliques that contain site i.
Then
P (Xi = xi|XI\{i} = xI\{i}) =P (Xi = xi,XI\{i} = xI\{i})
P (XI\{i} = xI\{i})=
P (Xi = xi,XI\{i} = xI\{i})∑
yi
P (Xi = yi,XI\{i} = xI\{i})
=
∏
C∈Ci
fC(xi,xC∩∂i) ·∏
C∈C\Ci
fC(xC)
∑
yi
∏
C∈Ci
fC(yi,xC∩∂i) ·∏
C∈C\Ci
fC(xC)
=
∏
C∈Ci
fC(xi,xC∩∂i)
∑
yi
∏
C∈Ci
fC(yi,xC∩∂i)
=
∏
C∈Ci
fC(xi,xC∩∂i)
∑
yi
∏
C∈Ci
fC(yi,xC∩∂i)·
∑
yI\i∂i
∏
C∈C\Ci
fC(xC∩∂i,yC\∂i)
∑
yI\i∂i
∏
C∈C\Ci
fC(xC∩∂i,yC\∂i)
=
∑
yI\i∂i
∏
C∈CfC(xC∩i∂i,yC\i∂i)
∑
yI\∂i
∏
C∈CfC(xC∩∂i,yC\∂i)
=P (Xi∂i = xi∂i)P (X∂i = x∂i)
= P (Xi = xi|X∂i = x∂i)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.70
The construction of a GRF out of a MRF
For the other direction (from MRF to GRF) we need a few auxiliary definitions and
results.
Given a function g : Rp → R : x 7→ g(x) and let o ∈ Rp be a reference state for which
g(o) > 0. Then define for each A ⊂ I = {1, . . . , p} the function
GA(x) = g(u(x)) where u : Rp → R
p and ui = xi if i ∈ A and ui = oi if i ∈ I\A
Further define HA(x) =∑
B⊆A(−1)#(A\B)GB(x)
Then we have the following results
• HØ(x) is a constant HØ(x) = g(o), ∀x
• HA(x) does not depend on the components of x with index outside A
If xA = yA, then HA(x) = HA(y)
• If one of the components of x with index in A takes the corresponding reference
value, then HA(x) = 0. for A 6= Ø, if xi = oi for at least one i ∈ A, then HA(x) = 0
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.71
Proof. Define
Bi = {B ⊂ A|i 6∈ B} , B = B ∪ {i} , Bi = {B = B ∪ {i}|B ∈ Bi},then Bi and Bi constitute a equal partition of 2A = {B ⊂ A}.For a pair {B,B = B ∪ {i}}, and for any x with xi = oi, we have that GB(x) = GB(x),
and so
HA(x) =∑
B∈Bi
[(−1)#(A\B)GB(x) + (−1)#(A\B)GB(x)
]= 0
• (Mobius Inversion) g(x) = GI(x) =∑
A⊆IHA(x)
Proof∑A⊆I HA(x) =
∑A⊆I∑
B⊆A(−1)#(A\B)GB(x)
=∑
B⊆I GB(x)∑
A:B⊆A⊆I(−1)#(A\B)
(We have switched the order of summations and moved GB(x) forward)
Denote D = A\B, then B ⊆ A ⊆ I ⇔ Ø ⊆ D ⊆ I\B, and so we get∑
A⊆IHA(x) =
∑
B⊆IGB(x)
∑
D⊆I\B(−1)#D
Unless B = I, the number of subsets D ⊆ I\B is even, and exactly half of those
subsets have an even #D, and the other half have an odd #D, hence all but one
terms in the outer sum are zero, leading to∑
A⊆IHA(x) = GI(x) = g(x)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.72
Proof of Hammersley-Clifford ⇒ MRF → GRF
TheoremIf g(x) = − log fX(x) where fX(x) is the joint probability distribution of a
MRF on x with cliques C, then in the construction above HA(x) = 0 if
A 6∈ C.Proof
Suppose that A 6∈ C, then there must be two elements, say i, j ∈ A so that i 6∈ ∂j and
vice versa.
For the given i, define as before
Bi = {B ⊂ A|i 6∈ B}B = B ∪ {i}Bi = {B = B ∪ {i}|B ∈ Bi},ThenHA(x) =
∑B∈Bi
[(−1)#(A\B)GB(x) + (−1)#(A\B)GB(x)
]
=∑
B∈Bi(−1)#(A\B) [GB(x)−GB(x)]
Denoting u = (xBoI\B), we have
GB(x) = − log fX(u)
= − log[fXI\{i}
(uI\{i}) · fXi|XI\{i}(xi|uI\{i})
]
= − log fXI\{i}(uI\{i})− log fXi|X∂i
(xi|u∂i)
Denoting u = (xBoI\B), we see that u and u differ only in i, so uI\{i} = uI\{i}, and so we
can write
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.73
GB(x) = − log fXI\{i}(uI\{i})− log fXi|X∂i
(oi|u∂i)
The difference between both is then
GB(x)−GB(x) = − log fXi|X∂i(xi|u∂i) + log fXi|X∂i
(oi|u∂i)
The common term that was anihilated, depended on index j, but what remains does
not, as j 6∈ ∂i, hence all terms in HA(x) do not depend on the value of xj. Hence,
HA(x) = HA(y), where yℓ = xℓ, for ℓ 6= j and yj = oj. We have seen that for such an
argument HA(y) = 0, from which the proof follows.
The proof assumes positivity because the anihilations that take place are implicitly
based on ratios of probabilities (differences of log-probabilities), which are all assumed
to be nonzero.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.74
Importance of Hammersley-Clifford in MCMC
The constructive proof of Hammersley-Clifford shows that given the condi-
tional probabilities in a Markov Model allow to construct the joint distribution
as
fX(x) =1
Z· exp
(−∑
C∈CHC(xC)
)
where for a chosen i ∈ C, and a reference state o
HC(xC) =∑
B⊂C|i∈Blog
(fXi|X∂i
(oi|u∂i)
fXi|X∂i(xi|u∂i)
)
where uj = xj if j ∈ C and uj = oj if j 6∈ C. The partition function Z follows
from the choices of o and i ∈ C.
HC states that conditional probabilities in a Markov model are sufficient to
define the joint probability of a random vector.
This is unlike marginal probabilities: they do not uniquely fix the joint probabil-
ity (as they contain no information about the dependence structure)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.75
A construction without cliques
In some applications (such as the one we will need), the clique potentials are just an
intermediate result. It is possible to construct the joint distribution directly from the
conditional distributions, however, without proving that it factorizes into clique potential
functions.
Simplified theorem (no cliques)fX(x)
fX(o)=
p∏
i=1
fXi|XI\{i}(xi|o{1,...,i−1}x{i+1,...,p})
fXi|XI\{i}(oi|o{1,...,i−1}x{i+1,...,p})
Or, otherwise stated
fX(x) ∝p∏
i=1
fXi|XI\{i}(xi|o{1,...,i−1}x{i+1,...,p})
fXi|XI\{i}(oi|o{1,...,i−1}x{i+1,...,p})
Proof
We start from the right-hand sidep∏
i=1
fXi|XI\{i}(xi|o{1,...,i−1}x{i+1,...,p})
fXi|XI\{i}(oi|o{1,...,i−1}x{i+1,...,p})
=
p∏
i=1
fX(o{1,...,i−1}x{i,...,p})/fXI\{i}
(o{1,...,i−1}x{i+1,...,p})
fX(o{1,...,i}x{i+1,...,p})/fXI\{i}
(o{1,...,i−1}x{i+1,...,p})
All numerators in this product cancel against the denominator in the previous factor,
leaving us with the first denominator and the last numerator, which is exactly the ex-
pression of the left hand side.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.76
Note on the existence of a joint distrubution
Note Hammersley-Clifford does not guarantee existence of the joint distribu-
tion, but if it exists, it is well defined by the conditional probabilities.
Example Consider X1|X2 ∼ exp(λX2) and X2|X1 ∼ exp(λX1), then according
to the construction above, we find that
f(X1,X2)(x1, x2) ∝fX1|X2
(x1|x2)
fX1|X2(o1|x2)
· fX2|X1(x2|o1)
fX2|X1(o2|o1)
= λx2e−λx2·x1
λx2e−λx2·o1· λo1e−λo1·x2λo1e−λo1·o2
∝ e−λx2·x1
The function exp(−λx2x1) has no finite integral on [0,∞[×[0,∞[, and therefore
it cannot be normalized to be a (2D) density function.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.77
From HC to Markov Chain Monte Carlo
• Sample from conditional distributions in MRF X (= any multivariate random
variable)
• Creates sequence of samples X1,X2,X3, . . . that are a Markov chain of
Markov Random Fields
•
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.78
2.3 MCMC samplers for integration
2.3.1 The Gibbs sampler
Suppose X is a p-dimensional random vector, and we can sample from con-
ditional densities fXi|XI\{i}(xi|xI\{i}) = fXi|X∂i(xi|x∂i)
Then we construct the following sampler
Set initial values x0 = (x0,1, . . . , x0,p)
for n = 1, 2, . . .
for i = 1, . . . , p
Draw Xn,i ∼ fXi|XI\{i}(x|xn;1,...,i−1xn−1;i+1,...,p)
The Gibbs-sampler consists of loops defined by conditional distributions.
Therefore, the sampler is based on the description of fX(x) as a Markov
random field. Moreover, the sequence can be seen as a Markov Chain.
So, the Gibbs sampler does NOT rely on the description of fX(x) as a Gibbs
random field. GRF will be at the basis of the Metropolis-Hastings sampler on
slide 90
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.79
Invariant distribution
On slide 81, we prove:
The joint distribution fX(x) is invariant under the loops of a Gibbs-sampler
We consider the sequence of states after each outer loop (i.e., iterations over
n), not the inner loops (over the vector components).
We consider the case of a discrete state space.
Lemma The transition probabilities over the outer loops satisfy
fXn+1|Xn(x|v) =
p∏
i=1
fXi|XI\{i}(xi|x1,...,i−1vi+1,...,p)
Proof (discrete case)
This follows from the chain rule P
(p⋂
i=1
Ai
∣∣∣∣∣B)
=
p∏
i=1
P
Ai
∣∣∣∣∣∣
i−1⋂
j=1
Aj ∩B
where
in our case Ai = {Xn+1;i = xi} and B = {Xn = v} �
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.80
Invariant distribution: proof
We now consider the case of a discrete state space, and suppose that
fXn(x) = fX(x), then
fXn+1(x) =
∑v fXn+1|Xn
(x|v) · fXn(v)
=∑
v
∏pi=1 fXi|XI\{i}(xi|x1,...,i−1vi+1,...,p) · fX(v)
=∑
vp· · ·∑v1
fX(v) · fX1|XI\{1}(x1|v2,...,p) ·∏p
i=2 fXi|XI\{i}(xi|x1,...,i−1vi+1,...,p)
=∑
vp· · ·∑v2
fX2,...,p(v2,...,p) · fX1|XI\{1}(x1|v2,...,p) · . . .∏p
i=2 fXi|XI\{i}(xi|x1,...,i−1vi+1,...,p)
=∑
vp· · ·∑v2
fX(x1v2,...,p) ·∏p
i=2 fXi|XI\{i}(xi|x1,...,i−1vi+1,...,p)
= · · ·= fX(x)
In the expressions above, we used that∑
v1
fX(v) = fX2,...,p(v2,...,p)
NB: The notation X2,...,p refers to the components of X, not to successive
Markov Chain realisations like in Xn.
We then used fX2,...,p(v2,...,p) · fX1|XI\{1}(x1|v2,...,p) = fX(x1v2,...,p)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.81
Reversibility
The proof of the invariance property of fX(x) w.r.t. the Gibbs sampler estab-
lished a global balance equation, not a detailed balance equation. A detailed
balance equation is necessary for reversibility.
The Gibbs-sampler as a whole is not reversible, meaning
fXn−1|Xn(xn−1|xn) 6= fXn+1|Xn
(xn−1|xn)
The probability that we arrive in xn−1 given xn
6= the probability that we come from xn−1 given that we are in xn
Each substep (inner loop) on its own is reversible. That is, if we have gener-
ated a new ith component xi, we could “undo” that step (“undo” in probabilistic
sense, that is). In order to undo the complete Gibbs iteration step, the sub-
steps have to be followed in reverse order.
One can prove that an reversible Gibbs sampler can be constructed by ran-
domizing the order of substeps.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.82
Convergence
Under mild assumption (positivity of fX(x)), the Gibbs sampler creates a
Markov chain for which Xndist−→X ∼ fX
If the Gibbs sampler Markov chain is irreducible and recurrent, then for any
integrable function h(x) we have
1
M
M∑
n=1
h(Xn)P→ E [h(X)]
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.83
Foundations for MCMC
MCMC is used for sampling from multidimensional random variables It has
two aspects
• Sampling proceeds through conditional probabilities/densities
• The subsequent samples are dependent→ Markov Chain
We have to make sure that
• Conditionals define the correct joint distribution in a unique way: Hammersley-
Clifford
• The Markov chain replaces the large number convergence
– The target joint distribution is invariant under the Gibbs sampler Markov
Chain
– The chain converges to the invariant distribution
– Although convergence is a limit property, all generated samples of a
Gibbs sampler can be used in estimating the expected value of h(X).
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.84
An example from Bayesian statistics
A Hidden Markov Random Field (HMM - Hidden Markov Model)
Suppose that we have the following graphical model for observations Y
Y • • • • • •| | | | | |
M • • • • • •| | | | | |
X • − • − • − • − • − •
•We observe Y , where Yi and Yj are dependent, but conditioned on the
hidden or latent states Xi and Xj they are independent.
• The observation consists of two parts: the real signal (expression) M and
the noise Y −M . Goal: inference on fM |Y (m|y)• The latent state is a binary label: Xi = +1 means that Mi is probably large,
Xi = −1 means that Mi is probably small.
• Large values of Mi are clustered
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.85
A formalisation of the graphical model
Suppose X ∈ {−1, 1}p ∼ Ising(τ, γ), that is
P (X = x) =1
T· exp
[−τ
p∑
i=2
xixi−1
]· exp
[−γ
p∑
i=1
xi
]
with partition function T =∑
x∈{−1,1}pexp
[−τ
p∑
i=2
xixi−1
]· exp
[−γ
p∑
i=1
xi
]
We observe Yi = Mi + Vi, with Vi independent normal observational errors
with zero mean and common variance σ2 and Mi a mixture:
Mi =1−Xi
2· Ri +
1 +Xi
2· Si with Ri ∼ N(0, κ2) and Si ∼ N(0, K2)
and all these are independent.
The hyperparameters γ, τ,K2, κ2, σ2 are assumed to be known.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.86
Bayesian inference 1: posterior law of total probability
We want to know E(M |Y )
The posterior total probability is
fMi|Y (m|y) = fMi|Xi=−1;Y (m|y) · P (Xi = −1|Y = y) + fMi|Xi=1;Y (m|y) · P (Xi = 1|Y = y)
The only dependence between components of Y lies in the Hidden Gibbs/Ising random
field, so
fMi|Xi=±1;Y (m|y) = fMi|Xi=±1;Yi(m|yi)Filling in leads to
fMi|Y (m|y) = fMi|Xi=−1;Yi(m|yi) · P (Xi = −1|Y = y) + fMi|Xi=1;Yi(m|yi) · P (Xi = 1|Y = y)
For Xi = −1, Yi = Ri + Vi ∼ N(0, σ2 + κ2).
Hence cov(Mi, Yi) = cov(Ri, Ri + Vi) = var(Ri) + 0 = κ2
And from properties of the multivariate normal distribution (See slides Chapter 1, page
29) we know
(Mi|Yi = y,Xi = −1) = (Ri|Yi = y) ∼ N
(κ2
κ2 + σ2· y, κ2σ2
κ2 + σ2
)
The same holds for Xi = 1, replacing κ2 by K2.
This leads to E(Mi|Y = y) = yi ·[
κ2
κ2 + σ2· P (Xi = −1|Y = y) +
K2
K2 + σ2· P (Xi = 1|Y = y)
]
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.87
Bayesian inference 2: posterior label probabilities
We still need P (Xi = −1|Y = y). We compute the marginal posterior proba-
bilities of Xi from the joint posterior: P (X = x|Y = y)
A Gibbs sampler for this posterior probability would draw from
P (Xi = xn,i|XI\{i} = xn;1,...,i−1xn−1;i+1,...,p;Y = y)
= P (Xi = xn,i|XI\{i} = xn;1,...,i−1xn−1;i+1,...,p;Yi = yi)
=P (Xi = xn,i|XI\{i} = xn;1,...,i−1xn−1,i+1,...,p) · fYi|X(yi|xn;1,...,i−1xn,ixn−1;i+1,...,p)
fYi|XI\{i}(yi|xn;1,...,i−1xn−1;i+1,...,p)
= P (Xi = xn,i|XI\{i} = xn;1,...,i−1xn−1,i+1,...,p) ·fYi|Xi
(yi|xn,i)fYi|Xi
(yi|1)·P (Xi=1)+fYi|Xi(yi|−1)·P (Xi=−1)
This expression has three components
1. Conditional probabilities
Yi|Xi = −1 ∼ N(0, κ2 + σ2) and Yi|Xi = 1 ∼ N(0,K2 + σ2)
2. Marginal probabilities
The prior marginal probabilities P (Xi = 1) and P (Xi = −1) have to be
computed from (Markov Chain) Monte Carlo sampling of the prior model.
3. The transition probabilities (see next slide)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.88
Transition probabilities
We know (from the proof of Hammersley-Clifford)
P (Xi = xi|XI\{i} = xI\{i}) = P (Xi = xi|X∂i = x∂i) =
exp
−
∑
C|i∈CHC(xC)
∑
yi
exp
−
∑
C|i∈CHC(yixC\{i})
which is in our case
P (Xi = xi|Xi−1 = xi−1, Xi+1 = xi+1)
=exp (−τ (xixi−1 + xixi+1)) · exp (−γxi)∑
yi∈{−1,1}exp (−τ (yixi−1 + yixi+1)) exp (−γyi)
=exp (−τ (xixi−1 + xixi+1)) · exp (−γxi)
exp (−τ (xi−1 + xi+1)) exp (−γ) + exp (+τ (xi−1 + xi+1)) exp (γ)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.89
2.3.2 Metropolis-Hastings sampler
Gibbs-sampler
1. Based on conditional probabilities in multidimensional random vector ⇒Markov random field
2. If vector components are highly correlated, conditional sampling leads to
values that are close to old ones: slow move through range of possible
values, hence slow convergence
Metropolis-Hastings sampler (MH)
1. Local update of previous sample: Markov Chain of samples (like Gibbs
sampler)
2. Based on joint probabilities⇒ Gibbs random field
A GRF is defined through clique potentials (Slide 65), which requires the computa-
tion of the normalising constant (partition function). MH will be able to sample from
fX(x) if it is known up to such a constant; cfr.: rejection sampling, slide 47
3. One or more dimensions (↔ Gibbs sampler is always for random vectors)
4. Uses rejection sampling: new sample has to be accepted
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.90
A proposal/transition distribution
Given a state Xn, a possible new state Xn′ is generated from a distribution
q(x|Xn = xn)
In principle, this proposal distribution can be any good choice. This distribution
should be easy to work with. It typically describes local updates.
The new state is accepted if
Xn+1 = Xn′ ⇔ U ≤ fX(Xn′) · q(Xn|Xn′)
fX(Xn) · q(Xn′|Xn)
where U ∼ uniform[0, 1]
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.91
The acceptance probability
Given a proposal Xn′ = xn′ the probability that it is accepted equals
α(xn′;xn) = min
(1,fX(xn′) · q(xn|xn′)
fX(xn) · q(xn′|xn)
)
Remark If the distribution has the form fX(x) =1
Zexp [−H(x)] then the ac-
ceptance probability does not depend on Z. Often Z is very hard to find (inte-
gration/summation over all possible configurations).
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.92
Transition probabilities from one state to the next
The transition probability becomes (case of discrete states)
• For xn+1 6= xn:P (Xn+1 = xn+1|Xn = xn)
= P (Xn′ = xn+1|Xn = xn) · P (xn+1 accepted |Xn = xn,Xn′ = xn+1)
= q(xn+1|xn) · α(xn+1,xn)
• The probability that the proposed state (whatever the proposal is) will be
rejected, given that the current state is Xn = xn equals
r(xn) := P (rejected|Xn = xn)
=∑
x
P (Xn′ = x|Xn = xn) · P (x rejected |Xn = xn,Xn′ = x)
=∑
x
q(x|xn) · (1− α(x;xn))
= 1−∑
x
q(x|xn) · α(x;xn)
=: 1− a(xn)
• For xn+1 = xn we obtain
P (Xn+1 = xn|Xn = xn) = q(xn|xn) · α(xn,xn) + (1− a(xn))
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.93
Equilibrium distribution
The objective distribution fX(x) is an invariant distribution of a Metropolis-
Hastings sampler
Proof
Denote the transition probabilities Pxy = P (Xn+1 = y|Xn = x)
It holds that Pxy = q(y|x) · α(y;x) + δ(x,y) · (1− a(x))
where δ(x,y) is the Kronecker-delta.
We consider Pxy · fX(x) = q(y|x) ·α(y;x) · fX(x)+ δ(x,y) · (1− a(x)) · fX(x)
We have
α(y;x) · q(y|x) · fX(x) = min(1, fX(y)·q(x|y)
fX(x)·q(y|x)
)· q(y|x) · fX(x)
= min (q(y|x) · fX(x), fX(y) · q(x|y))= min
(fX(x)·q(y|x)fX(y)·q(x|y), 1
)· q(x|y) · fX(y)
= α(x;y) · q(x|y) · fX(y)The Kronecker-delta term is only active if x = y, so formally, one can always
write
δ(x,y) · (1− a(x)) · fX(x) = δ(y,x) · (1− a(y)) · fX(y)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.94
Conclusion of the proof
We may conclude that Pxy · fX(x) = Pyx · fX(y)
This is a detailed balance equation. Not only is the objective distribution in-
variant under the Metropolis-Hastings sampler, but also
The Metropolis-Hastings sampler is reversible
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.95
A special case: the original Metropolis sampler
Suppose that the proposal distribution is symmetric in the sense that
q(x|y) = q(y|x)This is realized by chosing Y = X + η where η has a zero mean symmetric
distribution g(η), hence q(y|x) = g(y − x).
Then the acceptance probability for a proposal y given a current state x be-
comes
α(y;x) = min
(1,fX(y) · q(x|y)fX(x) · q(y|x)
)= min
(1,fX(y)
fX(x)
)
This was the original procedure, proposed by Metropolis. It was later refined
by Hastings for arbitrary proposal distributions.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.96
An example of a local update
Consider q(y|x) = fXi|XI\{i}(yi|xI\{i}), if yI\{i} = xI\{i} (and otherwise q(y|x) =0)
Configurations with yI\{i} 6= xI\{i} have probability (density) zero.
It then holds
fX(x) · q(y|x) = fX(x) · fXi|XI\{i}(yi|xI\{i})
= fX(x) · fX(xI\{i}yi)fXI\{i}(xI\{i})
and, keeping in mind that yI\{i} = xI\{i},
fX(y) · q(x|y) = fX(y) · fXi|XI\{i}(xi|yI\{i})
= fX(y) · fX(yI\{i}xi)
fXI\{i}(yI\{i})
= fX(yixI\{i}) ·fX(x)
fXI\{i}(xI\{i})= fX(x) · q(y|x)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.97
An example of a local update: Gibbs sampler
From the previous slide, it follows
1. The acceptance probability α(y;x) = min
(1,fX(y) · q(x|y)fX(x) · q(y|x)
)= 1
2. The process is reversible
In fact, this is one step of a Gibbs sampler.
1. In a general Metropolis-Hastings sampler, there is a free proposal, which is
evaluated: the evaluation uses the joint distribution
2. In the specific case of the Gibbs sampler, the proposal uses the conditional
distribution and there is no evaluation afterwards (so no joint distribution)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.98
Convergence
We have seen that the objective distribution is invariant under a Metropolis-
Hastings sampler.
This is not enough for good convergence.
Indeed, the case of one step of a Gibbs sampler illustrates that updates may
be too local: in this case, only one component of the vector is subject to pos-
sible change. That implies that many states x are unreachable. The Markov
Chain is then reducible.
Irreducibility is obtained if q(y|x) > 0 for all pairs (x,y).
As this condition is sometimes too restrictive for every sampler separately, one
may consider a combination of different proposal distributions, e.g., sequence
of one-at-a-time component Metropolis-Hastings sampler. (e.g., the Gibbs
sampler)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.99
Choice of the proposal distribution
The speed of convergence in a Metropolis-Hastings sampler depends on the
correlation between subsequent samples.
High correlation→ slow convergence
Subsequent samples should be as independent as possible. (Fully indepen-
dent samples are optimal, in the sense that the limiting distribution is reached
instantaneously, but they are often difficult to realize or difficult to sample from)
Inter-sample dependence depends on two adversary objectives
• Acceptance probability: low acceptance probability means high probability
that two subsequent samples are identical, hence, high correlation.
Acceptance probability is enhanced by proposal distributions with small
variance, i.e., a distribution that favours very local updates.
• Correlation between current state and proposed state. This source of cor-
relation is reduced by proposals that favour large updates.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.100