Chapter 3: Monte Carlo methods Overview Maarten...

Chapter 3: Monte Carlo methodsMaarten Jansen

Overview

1. Aspects of Monte Carlo Methods

1.1 Monte Carlo integration and importance sampling

1.2 Random number generators (slide 29)

1.2.1 Quantile method (slide 30)

1.2.2 Rejection sampling (slide 37)

2. Markov Chain Monte Carlo Methods

2.1 Markov Chains

2.2 Models for multivariate RV (slide 60)

2.2.1 Markov Random Fields (MRF) (slide 61)

2.2.2 Gibbs Random Fields (GRF) (slide 65)

2.2.3 The Hammersley-Clifford Theorem (slide 68)

2.3 MCMC samplers for integration

2.3.1 Gibbs sampler (slide 79)

2.3.2 Metropolis-Hastings sampler (slide 90)

2.4 Simulated annealing - MCMC optimization

c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.1

1. Aspects of Monte Carlo MethodsMonte Carlo simulation or stochastic simulation

• tries to re-formulate a problem such that its solution is the unknown param-

eter of an artificial random variable

• generates instances (an artificial sample) from that random variable

• applies statistical techniques to

– find (estimate) the parameter from the artificial sample

– evaluate the quality of the numerical outcome

• but it is essentially a method from numerical analysis

• Many of the applications of this numerical method come from statistical

problems:

statistical problem numerical solution statistical technique


Two main categories of problems

• Integration

• Optimization


1.1 Monte Carlo integration and importance sampling

Suppose we want to evaluate I =

∫ b

ay(x)dx

• Suppose X ∼ uniform[a, b], then I = (b− a) · E(y(X))

• Generate Xi, with i = 1, . . . , n

• Estimate I =b− a

n

n∑

i=1

y(Xi)


Accuracy of the stochastic approximation

We use statistical measures to evaluate the approximation

1. Bias

E(I) =b− a

n

n∑

i=1

E(y(Xi)) = (b−a)·E(y(X)) = (b−a)∫ b

ay(x)· 1

b− a·dx = I

The estimator is unbiased

2. Variance var(I) =(b− a)2

nvar(y(X)) =

(b− a)2

n

∫ b

a

(y(x)− I)2 · 1

b− a· dx

Variance has two components:

– Order of magnitude:

∗ σI = O(n−1/2)∗ typical result for variance of sample mean

∗ Independent from dimension

– Variance of one observation

Two questions

• How does this compare to competitors?

• How can we improve? → not on order of magnitude


Competitors: numerical integration (= quadrature)

Numerical integration is based on the principle: approximate the integrand by

a function that is easy to integrate.

The approximation is based on a limited number of observations of the inte-

grand only, and it is constructed using interpolation or smoothing.

The error of numerical integration methods depends on several factors

• The smoothness of the integrand, in particular: is the integrand easy to

approximate (see figures below)

• The number of function evaluations or observations n

• The location xi in which integrand is observed or evaluated

• The dimension (curse of dimensionality)


Functions that are difficult to approximate

Functions with

1. infinite slope

2. singularities

3. heavy oscillations

These features require locally dense observations/function evaluations


A very brief overview of quadrature methods

• For given xi, y(xi), quadrature formulas are based on

– Approximation of the integrand by polynomials:

∗ Rectangular rule or Midpoint rule

∗ Trapezoid rule

∗ Simpson’s rule

– Breaking up the interval [a, b] into subintervals→ composite rules

•When xi are free to choose, order of approximation can be optimised by

chosing the xi to be the zeros of orthogonal polynomials→Gauss Quadra-

ture


Accuracy of quadrature methods

Assuming that the integrand is “sufficiently smooth”, we have in one dimen-

sion the approximation Iq for I has the following accuracy

|I − Iq| ≤ C · n−1,and for many methods

|I − Iq| ≤ C · n−α,with α > 1

Compare with the precision of the random sampler,[E(I − I)2

]12 ∼ n−

12

Random sampling has two drawbacks

• Slower decay of error

• No hard upperbound for error


Curse of dimensionality

Observation

If n1 observations (function evaluations) are needed for given accuracy of a

numerical integration technique in one dimension, then the same technique

extended into higher dimensions requires nd1 observations; the error is of the

order of magnitude O(1/n1/d1 )

Reason

• – Accuracy of numerical integration is a deterministic thing: we must cover every

area in the region of integration to be sure that accuracy is met.

– Accuracy thus directly linked to interpoint-distance

– High dimensions means many dimensions in which two points can be distant

from each other.

– Much more observations needed for same interpoint distance

• Quadrature is based on clever approximations of functions. It’s hard to be clever

in high dimensions: hard to find equally good approximations.

No curse of dimensionality for stochastic simulation


Applications in statistics

• Computation of expected values E(h(X)) =

∫ ∞

−∞fX(x)h(x)dx

• Computation of probabilities P (X ∈ A) = E(χA(X)) =

∫

AfX(x)dx

(χA(X) is the characteristic or indicator function of A)

• Computation of quantiles QX(p) = F−1X (p) with FX(u) =

∫ u

−∞fX(x)dx

These problems appear in

• Bootstrapping and simulation

• Bayesian analysis: computation of posterior means, medians

• . . .


Non-uniform sampling

We have above the general expression µ = E(h(X)) =

∫ ∞

−∞fX(x)h(x)dx

which we can estimate by µ =1

n

n∑

i=1

h(Xi)

So, if we have an integral I =

∫ b

ay(x)dx

then we can define h(x) as h(x) =y(x) · χ[a,b](x)

fX(x)(if this ratio is bounded near zeros of fX(x))

and estimate I =1

n

n∑

i=1

h(Xi) =1

n

n∑

i=1

y(Xi) · χ[a,b](Xi)

fX(Xi)

where all Xi are IID and have density fX(x).


Examples

• fX(x) = y(x)/x i.e., h(x) = x

(only possible if y(x)/x is positive with integral equal to 1)

Then I = µX = E(X) =

∫ ∞

−∞x · fX(x) dx and I =

1

n

n∑

i=1

Xi

• h(x) = χA(x),

Then I = p = P (X ∈ A) =

∫

A

fX(x) dx and I =1

n

n∑

i=1

χA(Xi) =#{i|Xi ∈ A}

n

• fX(x) =1

b− a· χ[a,b](x) and take h(x) such that h(x) · fX(x) = y(x) (where we asume

that y(x) is zero outside [a, b] — note that h(x) outside [a, b] is free to choose)

Then I =

∫ b

a

y(x) dx and I =1

n

n∑

i=1

h(Xi) =b− a

n

n∑

i=1

y(Xi)

From these examples, it is clear that there are many ways to estimate an integral. We

formalise this problem.


The importance function

If X has density function fX(x) and we want to estimate

µ = E(h(X)) =

∫ ∞

−∞h(x) · fX(x) dx,

then we may estimate this from a sample Xi as

µ =1

n

n∑

i=1

h(Xi)

If it is easier to sample from fU(u) (for instance, uniform random variables are

easy to generate), then we can write

E(h(X)) =

∫ ∞

−∞h(u) · fX(u) du =

∫ ∞

−∞h(u) · fX(u)

fU(u)· fU(u) du

We call the new sampling distribution fU(u) the importance function


Importance sampling

With fU(u) an importance funtion, denote w(u) =fX(u)

fU(u)

As a result µ = E(h(X)) =

∫ ∞

−∞h(u) · w(u) · fU(u) du

We can now estimate µ =1

n

n∑

i=1

h(Ui) · w(Ui)

The question is now how to choose fU(u)

• It must be easy to generate samples from it

• The variance of the estimator must be as low as possible


The variance of importance sampling

The variance equals var(µ) =1

n· var

[h(U ) · w(U )

]

We can develop this as

var(µ) = E([h(U ) · w(U )

]2)−(E[h(U ) · w(U )

])2

= E([h(U ) · w(U )

]2)− µ2 = E([|h(U )| · w(U )

]2)− µ2

≥(E[|h(U )| · w(U )

])2 − µ2 = (E|h(X)|)2 − µ2

The lower bound is independent from fU(u). The inequality becomes an

equality if for V = |h(U )| · w(U ) it holds that E(V 2)= (EV )2, or, var(V ) =

E(V 2)− (EV )2 = 0, thus if V is deterministic (with prob. 1).

So, variance is minimized if |h(U )| ·w(U ) = K, for any random U , i.e., ∀u ∈ R.

We have |h(U )| · w(U ) = K ⇔ fU(u) = |h(u)|·fX(u)K , where K follows from

imposing∫∞−∞ fU(u) du = 1,

Conclusion: minimum variance for fU(u) =|h(u)| · fX(u)∫∞

−∞ |h(u)| · fX(u) du


Interpretation of this result

• The result is of little immediate use. Indeed, full application requires knowl-

edge of

∫ ∞

−∞|h(u)| · fX(u) du

If h(u) ≥ 0,∀u ∈ R, this is eactly the integral we are after. In the other

case, computation of this integral is probably equally difficult as the original

question.

• var(µ) can be much lower than when estimating µ with samples from fX(x).

• The basic idea is that fU(u) should behave not just as fX(x), but it should

also “follow” |h(u)|. Regions where h(u) is large in magnitude should be

sampled more.

• Pay special attention to tails of |h(u)| · fX(u)


Example with mixture of uniform sampling

• Mixture of L uniform random variables

• Uniform on (non-convex) subdomains Iℓ defined by Iℓ ={x∣∣∣|y(x)| ≥ ℓ/Lmax |y(x)|

}

• mixture probability mass functions pℓ = |Iℓ|/∑L

ℓ=1 |Iℓ|


Example from Bayesian statistics

Suppose we observe Xi|Mi ∼ N(M,σ2) with σ2 known. We want to estimate

the mean M , for which we impose a Cauchy prior model

fM(m) =1

π(1 + (m− µ)2),

where hyperparameter µ may express prior knowledge of expected values

(could be zero, e.g.)

The conditional sample density is

fX|M(x|m) =n∏

i=1

1√2πσ· e−(xi−m)2/2σ2

=1

(2π)n/2σn· e−

∑ni=1(xi−m)2/2σ2

=1

(2π)n/2σn· e−(x−m)2/(2σ2/n) · e−

∑ni=1(xi−x)2/2σ2

Then the joint distribution is

fM,X(m,x) = fM(m) · fX|M(x|m)

=1

π(1 + (m− µ)2)· 1

(2π)n/2σn· e−(x−m)2/(2σ2/n) · e−

∑ni=1(xi−x)2/2σ2


Marginal distribution in Bayesian model

So the marginal distribution of X becomes

fX(x) =

∫ ∞

−∞fM,X(m,x) dm =

∫ ∞

−∞fM(m) · fX|M(x|m) dm

= C(x) ·∫ ∞

−∞

1

1 + (m− µ)2· e−(x−m)2/(2σ2/n) dm

with C(x) =1

π· 1

(2π)n/2σn· e−

∑ni=1(xi−x)2/2σ2

(Note that the integral exists thanks to the rapid decay of the normal bell curve)


Bayes’ rule: posterior = joint/marginal

The posterior distribution of M , given the observation X becomes

fM |X(m|x) =fM(m) · fX|M(x|m)

fX(x)

=fM(m) · fX|M(x|m)∫ ∞

−∞fM(m) · fX|M(x|m) dm

=C(x) · 1

1 + (m− µ)2· e−(x−m)2/(2σ2/n)

C(x) ·∫ ∞

−∞

1

1 + (m− µ)2· e−(x−m)2/(2σ2/n) dm

=

1

1 + (m− µ)2· e−(x−m)2/(2σ2/n)

∫ ∞

−∞

1

1 + (m− µ)2· e−(x−m)2/(2σ2/n) dm


Bayesian estimation

Possible values of interest are

the posterior mean

E(M |X = x) =

∫ ∞

−∞

m

1 + (m− µ)2· e−(x−m)2/(2σ2/n) dm

∫ ∞

−∞

1

1 + (m− µ)2· e−(x−m)2/(2σ2/n) dm

and the posterior variance which is var(M |X = x) = E(M 2|X = x) −[E(M |X = x)

]2with

E(M 2|X = x) =

∫ ∞

−∞

m2

1 + (m− µ)2· e−(x−m)2/(2σ2/n) dm

∫ ∞

−∞

1

1 + (m− µ)2· e−(x−m)2/(2σ2/n) dm


Monte-Carlo procedure for integrals

At least two possibilities:

• Sample from normal density with expected value x and variance σ2/n:

Xn ∼ N(x, σ2/n), Then

E(Mk|X = x) =E(

Xkn

1+(Xn−µ)2)

E(

11+(Xn−µ)2

)

• Sample from Cauchy density with center (median) µ fU(u) = 1/[π(1+ (u−

µ)2)]

Then

E(Mk|X = x) =E(uk · e−(u−x)2/(2σ2/n)

)

E(e−(u−x)2/(2σ2/n)

)

• Sample from another distribution “as close as possible” to the integrand.


Arguments for using the normal sampler

In this case, the normal density is a much better choice than the Cauchy

• The tails of the integrand are lighter than the normal tail, so the heavy

tails of the Cauchy produce a lot of large samples whose values are not

representative for the integral

• The normal density has sample size (n) dependent variance, so that sam-

ples get more concentrated for large n, which corresponds to the true shape

of the integrand


Experiment: normal vs. Cauchy samplers

As an illustration, we plot the estimates of the standard errors in estimating

the following parameter

I =

∫ ∞

−∞

1

π · (1 + (u− µ)2)· 1√

2πσ/√n· e−(u−x)2/(2σ2/n) du

= E

(1

π · (1 + (Xn − µ)2)

)

= E

(1√

2πσ/√n· e−(U−x)2/(2σ2/n)

)

with Xn ∼ N(x, σ2/n) and U ∼ Cauchy(µ, 1)

We simulate Xn,i and Ui for i = 1, . . . , nMC and define

I1 =1

nMC

nMC∑

i=1

1

π · (1 + (Xn,i − µ)2)

I2 =1

nMC

nMC∑

i=1

1√2πσ/

√n· e−(Ui−x)2/(2σ2/n)


Precision of the estimators

We can easily estimate the variances

var(I1) = 1nMC

var(

1π·(1+(Xn−µ)2)

)

var(I2) = 1nMC

var(

1√2πσ/

√n· e−(Ui−x)2/(2σ2/n)

)

The estimates of the standard errors of a single observation (to be divided by√nMC) are depicted below (together with the log of the estimates, to better

show the behavior)

0 50 100 150 2000

0.1

0.2

0.3

0.4

n

estimated st.dev. of one observation

Cauchy samplesNormal samples

0 50 100 150 200-8

-6

-4

-2

0

n

log(estimated st.dev.) of one observation

Cauchy samplesNormal samples

Interpretation For n growing (this is not nMC), the tail of the integrand be-

comes lighter, making the Cauchy sampler less and less attractive. The nor-

mal sampler comes closer to the integrand.


Importance function must have sufficiently heavy tails

Previous examples have illustrated that it is of little use that the sampling den-

sity function has heavier tails than the integrand.

The opposite, is however, much worse (so if no perfect match can be realized,

a slightly too heavy tail is preferable)

We have var(µ) =1

n· var

[h(U ) · w(U )

]=

1

n· E[(h(U ) · w(U ))2

]− µ2

Herein

E[(h(U ) · w(U ))2

]=

∫ ∞

−∞[h(u)]2[w(u)]2fU(u) du

=

∫ ∞

−∞[h(u)]2

fX(u)2

fU(u)du

=

∫ ∞

−∞

h(u)fX(u)

fU(u)· h(u)fX(u) du

If h(u)fX(u) has a heavier tail than fU(u), then the first factor tends to infinity

for u → ∞. The integral may then be large or even infinity, depending on the

tail of h(u)fX(u)


Conclusions about importance sampling

Importance sampling allows to

• estimate expected values (integrals) with a random variable X whose dis-

tribution does not allow easy simulations, by drawing from another random

variable U which is easier, followed by proper re-weighting.

• optimize (to some extend) the choice of sample distribution to estimate in-

tegrals.

We will later discuss rejection sampling, which also samples from an auxiliary distri-

bution. The outcome is then rejected or accepted with an appropriate probability such

that, a posteriori, (given the event of rejection or acceptance) the variable takes the

aimed distribution. Unlike importance sampling, the correction of rejection sampling

thus proceeds at the level of the random number generator itself (and not at the level

of computing the integral). We therefore discuss random number generators.


1.2 Random number generators

Importance sampling assumes that we can generate numbers from a given

distribution. How can we do that?


1.2.1 Quantile or inversion method

Theorem If U ∼ uniform[0, 1] and QX(p) is the quantile function of X , then

QX(U ) has the same distribution as X , i.e.: U ∼ uniform[0, 1]⇔ QX(U )d= X

Proof uses monotonicity of QX(p) or its inverse FX(x)

P (QX(U ) ≤ x) = P (U ≤ Q−1X (x)) = FU(Q−1X (x)) = Q−1X (x) = FX(x)

Example 1: Let X ∼ exp(λ), then FX(x) = 1−e−λx, so QX(p) = − log(1−p)/λ,

so if U ∼ uniform[0, 1], then take

X = − log(1− U )/λ,

or, because 1− U is also uniform (for symmetry), we can take

Y = − log(U )/λ


Quantile or inversion method(2)

Example 1: Let X ∼ Cauchy with median µ, i.e., fX(x) =1

π[1+(x−µ)2]

If µ 6= 0, then X can be generated by adding µ to a Cauchy random variable

with median 0.

So, we assume that µ = 0.

Then FX(x) =12+ 1

πarctan(x), and QX(U ) = tan [π(U − 1/2)]

Note:if X ∼ Cauchy(µ = 0) then −X ∼ Cauchy(µ = 0) and

1/X ∼ Cauchy(µ = 0).

Indeed, for Y = 1/X, fY (y) = fX(x(y))

∣∣∣∣dx(y)

dy

∣∣∣∣ =1

π

1

1 + 1/y21

y2=

1

π

1

1 + y2

So, if X ∼ Cauchy(µ = 0) then Y = −1/X ∼ Cauchy(µ = 0).

And if X = tan [π(U − 1/2)] ∼ Cauchy(µ = 0), then Y = −1/X = tan(πU ) ∼Cauchy(µ = 0)


Example: Box-Muller transform for normal (1)

Problem: normal CDF FZ(z) = Φ(z) has no closed formula, working with

quantile QZ(U ) is not possible, unless software provides detailed tables of

QZ(p)

Solution: we go for two independent normal RV: (Z1, Z2) ∼ N2(0, I2), then

we know:

• Z21 + Z2

2 ∼ χ2(2) = exp(1/2)Indeed

1. P (Z2 < x) = P (−√x < Z <

√x) = Φ(

√x)− Φ(−

√x)

Hence fZ2(x) =[φ(√x) + φ(−

√x)]/(2√x) = e−x/2/

√2πx

which is: Z2 ∼ χ2(1) = Γ(1/2, 1/2)

2. Let Y = Z21 + Z2

2 , then

fY (y) =

∫ y

0

fZ2

1

(z)fZ2

2

(y − z) dz =

∫ y

0

e−z/2√2πz

e−(y−z)/2√2π(y − z)

dz

=e−y/2

2π

∫ y

0

dz√z(y − z)

=e−y/2

2π

∫ 1

0

dt√t(1− t)

=e−y/2

2πB(1/2, 1/2) =

e−y/2

2π

Γ(1/2)Γ(1/2)

Γ(1/2 + 1/2)= e−y/2/2



•√Z21 + Z2

2 ∼ Rayleigh

Indeed, let Y =√Z21 + Z2

2 , then

FY (y) = P (Y ≤ y) = P (Y 2 ≤ y2) = 1− e−y2/2

because F (x) = 1− eλx is the CDF of the exponential distribution

• Z1

Z2∼ Cauchy(µ = 0)

Indeed, let X = Z1/Z2, so, Z1 = XZ2, then

FX(x) = P (X ≤ x) =

∫ ∞

−∞fZ2

(z)P (X ≤ x|Z2 = z)dz

=

∫ 0

−∞fZ2

(z)P (Z1 ≥ zx)dz +

∫ ∞

0

fZ2(z)P (Z1 ≤ zx)dz

= 2

∫ ∞

0

fZ2(z)P (Z1 ≤ zx)dz

and so, fX(x) = 2

∫ ∞

0

fZ2(z)fZ1

(zx) z dz =2

2π

∫ ∞

0

ze−(1+x2)z2/2dz =1

π

∫ ∞

0

e−udu

1 + x2



We propose to generate a Cauchy RV X1 and an exponential RV X2 ∼ exp(1/2),

using X1 = tan(2πU2) and X2 = −2 log(U1)

Note that X1 = tan(2πU2) is Cauchy with the same parameters as X ′1 =

tan(πU2), since tan(πu) has period 1. We take X1 = tan(2πU2) instead of

X1 = tan(πU2) for reasons explained below.

Then solve the system

{Z1/Z2 = X1 = tan(2πU1)

Z21 + Z2

2 = X2 = −2 log(U2) = log(1/U 22 )

So suppose U1 and U2 are 2 independent, uniform r.v. on [0, 1] and let

Z1 =√log(1/U 2

2 ) cos(2πU1)

Z2 =√log(1/U 2

2 ) sin(2πU1).

Here sin(2πU1) and cos(2πU1) have the same distribution on [−1, 1]. This would

not be the case for cos(πU1) ∈ [0, 1]. This is why we take X1 = tan(2πU1).



Doublecheck: we are given the R

2 → R

2 transformation Z = g(U ) =√log(1/U 2

2 ) ·[cos(2πU1)

sin(2πU1)

].

The inverse g−1:

{U2 = e−

12(Z

21+Z2

2)

U1 = 12π arctan

(Z2

Z1

).

Using ddx arctan(x) =

11+x2, we find (Z1, Z2) ∼ NID(0, 1) :

fZ1,Z2(z1, z2) = fU1,U2

(12π arctan

(z2z1

), e−

12(z

21+z22)

)|J |

= 1 ·∣∣∣∣∣det

[∂u1

∂z1∂u1

∂z2∂u2

∂z1∂u2

∂z2

]∣∣∣∣∣

=

∣∣∣∣∣det[

12π · 1

1+(z2/z1)2· −z2

z21

12π · 1

1+(z2/z1)2· 1z1

e−12(z

21+z22) · (−z1) e−

12(z

21+z22) · (−z2)

]∣∣∣∣∣= 1

2π· e−1

2(z21+z22) · 1

1+(z2/z1)2·(z22z21+ 1)

= 1√2πe−

z212 · 1√

2πe−

z222 .


1.2.2 Rejection sampling

Suppose we want to generate random numbers with density f(x) and cumu-

lative distribution F (X).

Theorem

Let X ∼ gX and ∀x ∈ R : f(x) ≤ M · gX(x) and let U ∼ uniform[0, 1],

independent from X , then F (x) = P

(X ≤ x

∣∣∣∣U ≤f(X)

M · gX(X)

)


Rejection sampling: proof of the theorem

F (x) = P

(X ≤ x

∣∣∣∣U ≤f(X)

M · gX(X)

)

Proof

P(X ∈ A & U ≤ f(X)

M ·gX(X)

)=

∫

AgX(x) · P

(U ≤ f(X)

M · gX(X)

∣∣∣∣X = x

)dx

=

∫

AgX(x) · P

(U ≤ f(x)

M · gX(x)

)dx

=

∫

AgX(x) ·

f(x)

M · gX(x)dx

=

∫

A

f(x)

Mdx =

1

M

∫

Af(x) dx

Hence, if A =]−∞, x], then

P(X ∈ A

∣∣∣U ≤ f(X)M ·gX(X)

)=

P(X∈A & U≤ f(X)

M ·gX (X)

)

P(U≤ f(X)

M ·gX(X)

) =P(X∈A & U≤ f(X)

M ·gX (X)

)

P(X∈R & U≤ f(X)

M ·gX (X)

)

=

1

M

∫

Af(x) dx

1/M=

∫

Af(x) dx = F (x)


Algorithm

Situation and aim

We want a random number X with density function f(X). We have no expression for

F (x) that we can invert. We can generate numbers according to a different law gX(x)

and we know that f(x) ≤M · gX(x), for all values of x.

Pseudo-code

continue-search = TRUE

While continue-search

• Generate X ∼ gX

• Generate U ∼ uniform[0, 1]

• If U ≤ f(X)/[M · gX(X)

]

then continue-search = FALSE

The output is X


How to choose gX(x)?

•X ∼ gX should be easy to generate

• gX(x) should be as close as possible to f(x), such that M can be close

to 1, and rejection probabilities are low. Otherwise, computational efforts

increase.

• Some combinations don’t work: for instance, one can never generate Cauchy

variables by rejection sampling applied to normal variables, simply because

there is no M satisfying the condition.


Example 1: generating Gamma-distributed r.v.

Let X ∼ Gamma(λ, α), i.e.,fX(x) = xα−1λαe−λx

Γ(α)

If α is integer, then we can write X =α∑

i=1

Xi with independent Xi ∼ exp(λ)

If α is not integer, denote δ = α− ⌊α⌋ and r = ⌊α⌋Then we can decompose or generate X as

X = (Xr +Xδ)/λ with Xr ∼ Gamma(1, r) and Xδ ∼ Gamma(1, δ) and both

independent.

We can generate Xr as sum of exponentials, but for Xδ, the quantile method

does not work, so we need another direction.


Generating Gamma values with small α

The distribution function of X ∼ Gamma(1, δ) is fX(x) =xδ−1e−x

Γ(δ)It is depicted below for δ = 0.23

0 2 4 60

5

10

15

Not straightforward to bound by some M · gX(x)


A mixture distribution as upper bound

We will use a mixture distribution. Suppose

V ∼ uniform[0, 1]

S = χ[0,p](V ) for some value p

X1 ∼ g1 with g1(x) = δ · xδ−1 on [0, 1]

X2 − 1 ∼ exp(1), hence g2(x) = e−(x−1) = e · e−x on [1,∞[

X = S ·X1 + (1− S) ·X2

In other words, X = X1 with probability p and X = X2 with probability 1− p.

Therefore, generate two uniform RV: V and W . If V < p, then X = QX1(W ),

otherwise X = QX2(W ).

In one formula: X = I(V < p) QX1(W ) + I(V ≥ p) QX2

(W )

where QX1(W ) = W 1/δ and QX2

(W ) as on slide 30

Remark We can generate W and S = I(V < p) both from V in a mutually

independent way. In particular, let W = SV/p + (1 − S)(1 − V )/(1 − p), then

W is independent from S.


The mixture distribution and density

The cumulative distribution of X , denoted as GX(x) is then

GX(x) = P (S = 0) · P (X ≤ x|S = 0) + P (S = 1)P (X ≤ x|S = 1)

= (1− p) · P (X2 ≤ x|S = 0) + p · P (X1 ≤ x|S = 1)

= (1− p) ·G2(x) + p ·G1(x)

GX(x) = (1− p) ·G2(x) + p ·G1(x)

and from there gX(x) = p · g1(x) + (1− p) · g2(x)In our case gX(x) = p · δ · xδ−1 · χ[0,1](x) + (1− p) · e · e−x · χ[1,∞[(x)


Optimizing the parameters in the function gX(x)

The value of p can be chosen to minimize the number of rejections, i.e., to minimize M .

•We need that M · gX(x) ≥ fX(x)

• For x ∈ [0, 1], this means that

M · p · δ · xδ−1 ≥ e−x · xδ−1Γ(δ)

⇔M ≥ e−x

pδΓ(δ)

The maximum in the right hand side is reached if x = 0, hence M ≥ 1

pδΓ(δ)

• For x ≥ 1, this becomes

M · (1− p) · e · e−x ≥ e−x · xδ−1Γ(δ)

⇔M ≥ xδ−1

(1− p)eΓ(δ)

The maximum in the right hand side is reached if x = 1, hence M ≥ 1

(1− p)eΓ(δ)

• The minimum M can be obtained if both lower bounds for M are equal, i.e., if

pδΓ(δ) = (1− p)eΓ(δ)⇔ p =e

e + δ


The resulting algorithm

We have M =1

pδΓ(δ)=

e + δ

eδΓ(δ)So, for x ∈ [0, 1], we findfX(x)

M · gX(x)=

e−x · xδ−1/Γ(δ)M · p · δ · xδ−1 = e−x which is smaller than 1

and for x > 1, we findfX(x)

M · gX(x)=

e−x · xδ−1/Γ(δ)M · (1− p) · e1−x = xδ−1 which is smaller than 1 because δ − 1 is negative.

While search == true,

• Generate independent U, V,W ∼ unif([0, 1])

• if V < p = e/(e + δ),

then

– X = W 1/δ

– If U < e−X , then search← false

else

– X = − log(W )

– If U < Xδ−1, then search← false


Example 2: computing the integral on page 25

In Bayesian inference with a Cauchy prior and normal errors, we have to com-

pute a ratio of the form

r =

∫ ∞

−∞xm · 1

bπ ·[1 +

(x−ab

)2] ·1√2πσ· e−(x−µ)2/2σ2

dx

∫ ∞

−∞

1

bπ ·[1 +

(x−ab

)2] ·1√2πσ· e−(x−µ)2/2σ2

dx

Using rejection sampling, we will generate data X from a distribution propor-

tional to

fX(x) = K · 1

1 +(x−ab

)2 · gX(x),

where gX(x) is the normal density function.

It then holds that r = E(Xr)

We can draw observations from X even if we know fX(x) only up to a constant.

(see next slide)


Example 2: drawing observations without knowing the

normalisation constant

We can draw observations from X even if we know fX(x) only up to a constant.

Indeed, let fX(x) = K · f(x) with K unknownfX(x)

M · gX(x)=

K · f(x)M · gX(x)

Herein f(x) and gX(x) are known and f(x)/gX(x) is known to be bounded by

C, then take M ≥ KC.

In the example above

fX(x)

M · gX(x)=

K · gX(x) · 1

1+(x−ab )2

M · gX(x)=

K

M[1 +

(x−ab

)2]

Take M = K, then the result is bounded by 1. (Note that M = K remains

unknown)

So generate X ∼ gX , then check if

1

1+(X−ab )2 ≤ U


2. Markov Chain Monte Carlo Methods

• Monte Carlo Methods are based on independent sampling, law of large

numbers, central limit theorem

• Independent sampling may be difficult to realize, especially when we sam-

ple from a large dimensional vector X

• Markov Chain Monte Carlo (MCMC) Methods simulate a sequence of

dependent observations


2.1 Markov Chains

Discrete time Markov Chain←→ continuous time Markov Chain

We consider discrete time MC

Discrete state space MC←→ general state space MC

A Discrete state space MC is a sequence of RV’s (Xn;n ∈ N) for which

Xn ∈ E. The state space E is countable and thus homomorphic with Z.

(We can take E = Z.) The sequence satisfies the Markov condition, i.e.,

P (Xn+1 = j|X0 = i0, X1 = i1, . . . , Xn = in) = P (Xn+1 = j|Xn = in)

Define P(n)ij = P (Xn+1 = j|Xn = i)

The Markov Chain is stationary or homogeneous if P(n)ij does not depend on

n. We can write Pij = P (Xn+1 = j|Xn = i)

The matrix with elements Pij is called the transition matrix


Irreducibility

(We further assume stationary MC, unless otherwise stated)

n-step Transitions

If P is the transition matrix of a discrete space Markov process, then

P (Xm+n = j|Xm = i) = (P n)ij

Accessibility

A state j is accessible from a state i if ∃n ∈ N, such that (P n)ij > 0. We

denote i→ j

Two states are communicating if they are mutually accessible from each

other. We denote i↔ j

If all states communicate, the MC is said to be irreducible


Period

The smallest di for which(P di)ii> 0 is called the period of state i. It follows

that (P n)ii > 0⇔ n = k · di with k ∈ N

If i↔ j, then di = dj

Corollary An irreducible MC has the same period for all its states.

Proof

∃n ∈ N for which (P n)ij > 0 and ∃m ∈ N for which (Pm)ji > 0

Now suppose that (P r)jj > 0, then (P n+r+m)ii > (P n)ij · (P r)jj · (Pm)ji > 0, hence we

know that r = ki · di and n + r +m = kj′ · dj.We also have

(P 2r)jj

> (P r)jj · (P r)jj > 0, hence n + 2r + m = kj′′ · dj, and so r =

(n + 2r +m)− (n + r +m) = (kj′′ − kj′) · dj = kj · dj.So, any r = ki · di can be written as r = kj · dj

A similar argument leads to the conclusion that any r = kj ·dj can be written as r = ki ·di.This is only possible if dj = di


Transient states

Denote (random variable) the first revisit

Tii the first n > 0 so that Xn = i given that X0 = i

We know that Tii = k · di and P (Tii = k · di) > 0.

Denote Vii =∞∑

k=1

I(Xk·di = i)|{X0 = i} =∞∑

n=1

I(Xn = i)|{X0 = i}

(with I(A) the indicator function of event A)

A state i is transient if E(Vii) <∞, that is, if an infinite number of steps in

the Markov Chain leads at most to a finite number of visits to state i.

E(Vii) =∞∑

n=1

E(I(Xn = i)|X0 = i) =∞∑

n=1

P (Xn = i|X0 = i)

So

E(Vii) <∞⇔∞∑

n=1

P (Xn = i|X0 = i) <∞

This is equivalent to P (Tii <∞) < 1


Transient states have a limited expected number of visits

E(Vii) <∞⇒ P (Tii <∞) < 1

Proof

Suppose that P (Tii < ∞) = 1, and denote T(r)ii the number of steps until the

rth occurence of state i. Then, because of the Markov condition,

T(r)ii =

∑rℓ=1 Tii,ℓ with Tii,ℓ IID observations from Tii.

P (T(r)ii <∞) = P

(r⋂

ℓ=1

(Tii,ℓ <∞)

)=

r∏

ℓ=1

P (Tii,ℓ <∞) = 1 for any finite r.

So Vii ≥ r, a.s. for any r ∈ N, hence E(Vii) =∞. �

This implies that µii = E(Tii) =∞ but the opposite does not hold (see next

slide).


Recurrent states

A state is called recurrent if it is not transient, i.e., if it is visited an infinite

number of times.

If the expected time until the first visit is infinite, i.e., if µii = E(Tii) = ∞, then

the state is called null-recurrent, otherwise it is called positive or ergodic.

A null-recurrent state is visited an infinite number of times, but the relative

number of visits tends to zero: E(Vii) =∞∑

n=1

(P n)ii =∞ and1

N

N∑

n=1

(P n)ii → 0

A positive state has 1N

∑Nn=1 (P

n)ii → 1µii


Proof (as of yet incomplete)

We prove that in a positive state, it holds that1

N

N∑

n=1

P (Xn = i|X0 = i)→ 1

E(Tii)

• Law of total probability + Markov condition for n > 0:

P (Xn = i|X0 = i) =∑n

k=1 P (Xn = i|Xk = i) · P (Tii = k)

=∑n

k=1 P (Xn−k = i|X0 = i) · P (Tii = k)

Also, P (Xn = i|X0 = i) = 1 for n = 0.

• If we define tk = P (Tii = k), with t0 = 0 and pn = P (Xn = i|Xk = i), then we have

pn =∑n

k=1 pn−k · tk =∑n

k=0 pn−k · tk for n > 0 and p0 = 1 6= p0 · t0 = 0.

• Denoting t = (tk, k ∈ N) and p = (tn, n ∈ N), then the sum above is the convolution of

the sequences t and p: t ∗ p. Since the expression does not hold for n = 0, we have

to correct with a Kronecker sequence δ0 = (1, 0, 0, 0, . . .). We get: p = t ∗ p + δ0

• Denote a(s) =

∞∑

k=0

aksk, then the equation above becomes p(s) = p(s) · t(s) + 1 ⇔

p(s) =1

1− t(s)

• Since t(1) =∑∞

k=1 P (Tii = k) = P (Tii < ∞), a recurrent Markov process has a

singularity in for p(s) in s = 1. Further, lims→1

(1 − s) · p(s) = lims→1

1− s

1− t(s)=

1

t′(1)and


t′(1) =∑∞

k=1 k · P (Tii = k) = E(Tii) = µii

• On the other hand,

lims→1(1− s) · p(s) = limu→∞1u · p(1− 1/u) = limn→∞

1n ·∑∞

k=0 pk(1− 1/n)k

= limn→∞1n·∑n

k=0 pk + limn→∞1n·∑n

k=0 pk[(1− 1/n)k − 1

]

+ limn→∞1n ·∑∞

k=n+1 pk(1− 1/n)k

• ...


Equilibrium distribution

Theorem In an irreducible discrete time, discrete state space MC the states

are either all transient, all null-recurrent, or all positive (ergodic).

All finite state MC are positive

Denote pn,i = P (Xn = i), and row vector pn = (. . . , pn,i . . .), then pn+1 = pn · Ppn+1,i =

∑

ℓ∈Z

Pℓipn,ℓ

P · 1 = 1 (because transition probabilities sum to one)

λ = 1 is an eigenvalue

The left eigenvector is an invariant or stationary or equilibrium distribution

p · P = p


Reversed Markov Processes

If (Xn;n = 0, . . .) is a Markov Chain with transition matrix P and equilibrium

distribution p, then

P (Xn = j|Xn+1 = i,Xn+2 = i2, . . . , Xn+m = im)

= P (Xn = j|Xn+1 = i) =pjpi· Pji

Proof (Bayes, Chain rule for conditional probabilities and Markov Condition)

P (Xn = j|Xn+1 = i,Xn+2 = i2, . . . , Xn+m = im)

=P (Xn = j) · P (Xn+1 = i,Xn+2 = i2, . . . , Xn+m = im|Xn = j)

P (Xn+1 = i,Xn+2 = i2, . . . , Xn+m = im)

=P (Xn = j) · P (Xn+1 = i|Xn = j) · P (Xn+2 = i2, . . . , Xn+m = im|Xn+1 = i)

P (Xn+1 = i) · P (Xn+2 = i2, . . . , Xn+m = im|Xn+1 = i)

= P (Xn = j|Xn+1 = i) =P (Xn = j) · P (Xn+1 = i|Xn = j)

P (Xn+1 = i)=

pj·Pji

pi


Reversible Markov Processes

If there exists a distribution pi = P (X = i) that satisfiespj · Pji

pi= Pij

then the Markov chain is called Reversible

Remark Reversibility thus means not that the reversed Markov process exists (it al-

ways exists), but that its transition probabilities for i → j are the same as the forward

probabilities for the same transitions i→ j (so NOT for j → i)

The distribution is then the equilibrium distribution. Indeed, from summation

of pj · Pji = piPij we obtain:∑

j

pj · Pji = pi∑

j

Pij = pi, which is, in matrix

form, p · P = p, the invariant distribution equation.

The reverted process is of course the same.

The conditionpj · Pji

pi= Pij is called the detailed balance equation (since it

implies the “global” balance equation)


2.2 Models for multivariate random variables

1. In the next slides we consider vectors X with multivariate distributions.

We discuss two ways to define/fix any multivariate distribution

•Markov Random Field (MRF), which is special case of a graphical

model

• Gibbs Random Field (GRF)

2. The Markov property (dependence through adjacency) plays a role both

on the level of the sampling process as on the level of the sampled multi-

variate random variable: Markov Chains for the sampling, Markov Random

Fields for the sampled variable


2.2.1 Markov Random Field (MRF)

Given a multivariate random variable X, a graphical model can be used to

represent the intra-dependencies.

An undirected graph is a ordered pair of sets G = (V, E), where V =

{1, . . . , p} is the set of vertices, sites or nodes, which are here indices into

X. The set E contains the (undirected) edges in the graph, where an undi-

rected edge is an unordered pair of vertices.

In a Markov Random Field, two vertices i and j are connected by an edge

if and only if the corresponding components of x are conditionally dependent,

i.e., given all the other components’ values.

P(Xi = xi

∣∣{X1, . . . , Xp}\{Xi})6= P

(Xi = xi

∣∣{X1, . . . , Xp}\{Xi, Xj})

The two sites are then called neighbours.

A neighbourhood of i is defined as ∂i = {j|{i, j} ∈ E}Formally, denoting by 2V all subsets of V , we have

∂ : V → 2V : i 7→ ∂i = {j|{i, j} ∈ E}

Markov property: it holds that P(Xi = xi

∣∣XV \{i})= P

(Xi = xi

∣∣X∂i

)


Examples of MRFs

• In principle any multidimensional probability distribution can be seen as a

MRF. In general, all components are conditionally dependent, so

∂i = {1, . . . , p}\{i}• A (finite sample from a) Markov Chain is also a MRF. Indeed, (thanks to the

notion of reversed Markov Processes)∂i = {i− 1, i + 1}– Forward Markov Chain: • → • → • → •– Reversed MC: • ← • ← • ← •– MRF representation: • − • − • − •

• A two-dimensional MRF:• − • − • − •| | | |• − • − • − •| | | |• − • − • − •| | | |• − • − • − •

– Dimension of random vector X is p = 16

– X has a 2D-geometric background

– Components of X can be represented with

a 2D index: Xs = X(i,j)


A short note on graphical models

Markov Random Fields are an example of graphical models

Graphical models are used to define or represent multivariate random vari-

ables X

MRF are undirected graphs, edges define neighbourhoods ∂i

MC (Markov Chains) are an example of Bayesian networks: directed, acyclic

graphs: Edges define parents of nodes par(i)

The construction of the joint probability in a directed graph is immediate

fX(x) =

p∏

i=1

fXi|Xpar(i)(xi|xpar(i))

• when par(i) = Ø, then the conditional distribution should be interpreted as

the marginal distribution

• The construction is always possible because the graph is acyclic

• For MRF’s/undirected graphs, the joint pdf/pmf is not so straightforward, we

need the concept of Gibbs Random Fields (see slide 65)


Example of modelling by Bayesian networks

Let X = (X1, X2, X3), then

• the graph • ← • → • represents the situation where X1 and X3 are de-

pendent, but, given the value of X2, (=conditionnally) they are independent.

The dependence occurs through X2

• the graph • → • ← • represents the situation where X1 and X3 are inde-

pendent, but X2 depends on both. If X2 is observed, this gives information

on both X1 and X3, so X1 and X3 are conditionnally dependent. (By obser-

vation of X2, we learn about both X1 and X3)

These models are used, for instance, in studies of causality, and are popular

in several (other) domains of statistical learning


2.2.2 Gibbs Random Field (GRF)

Let X be a multivariate random variable of dimension p, and let E be a set of

edges defined on V = {1, . . . , p}.Unlike in MRF, the edges in a GRF are not defined on the basis of a conditional

probability. They are used to define the global probability, as follows:

A clique (or complete subset) is defined as

C ⊂ V is a clique ⇔ ∀i ∈ C : C ⊂ {i} ∪ ∂i

The set of cliques is denoted as C C = {C ⊂ V |∀i ∈ C : C ⊂ {i} ∪ ∂i}A probability distribution that can be decomposed into factors associated with

the cliques is called a Gibbs Random Field (GRF)

fX(x) is a GRF ⇔ fX(x) =1

Z

∏

C∈CfC(xC) =

1

Zexp

(−∑

C∈CHC(xC)

)

The functions HC(xC) are (up to constant) the logarithms of fC(xC). They are

called clique potentials. The normalizing constant Z is called a partition

function.


Gibbs Random Field - further discussion

Use of GRF’s

• GRF’s can be used to define a joint probability on an undirected graph

• MRF’s represent local, conditional probabilities

• THe Hammersley-Clifford theorem (slide 69) finds connection GRF-MRF

Examples of GRF’s

• In principle any multidimensional probability distribution can be seen as

a GRF. In general, all components are conditionally dependent, and the

cliques are all subsets of V . All clique potentials are zero, except for C = V ,

whose potential is HV (x) = − log(fX(x)).

• Ising model (see slide 67)


Example of GRF: Ising model

A two dimensional lattice {(i, j)|0 ≤ i ≤ m, 0 ≤ j ≤ n} (see slide 62) can be

equiped with a neighbourhood system by defining for each internal site

∂(i, j) = {(i− 1, j), (i + 1, j), (i, j − 1), (i, j + 1)}The cliques are then singletons and (horizontal and vertical) pairs of sites

C ={{(i, j)}

}∪{{(i, j), (i + 1, j)}

}∪{{(i, j), (i, j + 1)}

}

In the case where the observations are binary, say X(i,j) ∈ {−1, 1}, a popular

GRF model is the Ising model

HC(xC) = τ · xC,1 · xC,2 for the pairs and Hs(xs) = γ · xs for the singletons.

The pair’s potentials express the interaction between adjacent sites, while the

singleton potentials express a drift towards one of the two states.


2.2.3 The Hammersley-Clifford Theorem: conditions

MRFs are defined by conditional probabilities, based on a neighbourhood sys-

tem.

GRFs are defined by a joint probability, decomposed into clique potentials.

The Hammersley-Clifford Theorem states that under mild conditions, both def-

initions are equivalent, i.e., a MRF is also a GRF and vice versa.

Two important conditions: existence of joint pdf + positivity

Existence of fX(x): See slide 77

Positivity condition

A probability distribution is said to satisfy the positivity condition if

∀i = 1, . . . , p : fXi(xi) > 0 implies that for x = (x1, . . . , xi, . . . , xp)

we have fX(x) > 0

A counterexample of such a distribution is a uniform distribution on the unit

disk: fX(0.9, 0.8) = 0 although fX1(0.9) > 0 and fX2

(0.8) > 0


The Hammersley-Clifford Theorem

Theorem

If fX(x) exists and satisfies the positivity condition, then X is a MRF

with neighbourhood system ∂ if and only if it is a GRF whose cliques

C follow from the neighbourhood system ∂.


⇐: GRF → MRF

Suppose X is a GRF with cliques C based on neighbourhood system ∂. Further denote

I = {1, . . . , p}, and i∂i = {i} ∪ ∂i. Let Ci = {C ∈ C|i ∈ C} be the cliques that contain site i.

Then

P (Xi = xi|XI\{i} = xI\{i}) =P (Xi = xi,XI\{i} = xI\{i})

P (XI\{i} = xI\{i})=

P (Xi = xi,XI\{i} = xI\{i})∑

yi

P (Xi = yi,XI\{i} = xI\{i})

=

∏

C∈Ci

fC(xi,xC∩∂i) ·∏

C∈C\Ci

fC(xC)

∑

yi

∏

C∈Ci

fC(yi,xC∩∂i) ·∏

C∈C\Ci

fC(xC)

=

∏

C∈Ci

fC(xi,xC∩∂i)

∑

yi

∏

C∈Ci

fC(yi,xC∩∂i)

=

∏

C∈Ci

fC(xi,xC∩∂i)

∑

yi

∏

C∈Ci

fC(yi,xC∩∂i)·

∑

yI\i∂i

∏

C∈C\Ci

fC(xC∩∂i,yC\∂i)

∑

yI\i∂i

∏

C∈C\Ci

fC(xC∩∂i,yC\∂i)

=

∑

yI\i∂i

∏

C∈CfC(xC∩i∂i,yC\i∂i)

∑

yI\∂i

∏

C∈CfC(xC∩∂i,yC\∂i)

=P (Xi∂i = xi∂i)P (X∂i = x∂i)

= P (Xi = xi|X∂i = x∂i)


The construction of a GRF out of a MRF

For the other direction (from MRF to GRF) we need a few auxiliary definitions and

results.

Given a function g : Rp → R : x 7→ g(x) and let o ∈ Rp be a reference state for which

g(o) > 0. Then define for each A ⊂ I = {1, . . . , p} the function

GA(x) = g(u(x)) where u : Rp → R

p and ui = xi if i ∈ A and ui = oi if i ∈ I\A

Further define HA(x) =∑

B⊆A(−1)#(A\B)GB(x)

Then we have the following results

• HØ(x) is a constant HØ(x) = g(o), ∀x

• HA(x) does not depend on the components of x with index outside A

If xA = yA, then HA(x) = HA(y)

• If one of the components of x with index in A takes the corresponding reference

value, then HA(x) = 0. for A 6= Ø, if xi = oi for at least one i ∈ A, then HA(x) = 0


Proof. Define

Bi = {B ⊂ A|i 6∈ B} , B = B ∪ {i} , Bi = {B = B ∪ {i}|B ∈ Bi},then Bi and Bi constitute a equal partition of 2A = {B ⊂ A}.For a pair {B,B = B ∪ {i}}, and for any x with xi = oi, we have that GB(x) = GB(x),

and so

HA(x) =∑

B∈Bi

[(−1)#(A\B)GB(x) + (−1)#(A\B)GB(x)

]= 0

• (Mobius Inversion) g(x) = GI(x) =∑

A⊆IHA(x)

Proof∑A⊆I HA(x) =

∑A⊆I∑

B⊆A(−1)#(A\B)GB(x)

=∑

B⊆I GB(x)∑

A:B⊆A⊆I(−1)#(A\B)

(We have switched the order of summations and moved GB(x) forward)

Denote D = A\B, then B ⊆ A ⊆ I ⇔ Ø ⊆ D ⊆ I\B, and so we get∑

A⊆IHA(x) =

∑

B⊆IGB(x)

∑

D⊆I\B(−1)#D

Unless B = I, the number of subsets D ⊆ I\B is even, and exactly half of those

subsets have an even #D, and the other half have an odd #D, hence all but one

terms in the outer sum are zero, leading to∑

A⊆IHA(x) = GI(x) = g(x)


Proof of Hammersley-Clifford ⇒ MRF → GRF

TheoremIf g(x) = − log fX(x) where fX(x) is the joint probability distribution of a

MRF on x with cliques C, then in the construction above HA(x) = 0 if

A 6∈ C.Proof

Suppose that A 6∈ C, then there must be two elements, say i, j ∈ A so that i 6∈ ∂j and

vice versa.

For the given i, define as before

Bi = {B ⊂ A|i 6∈ B}B = B ∪ {i}Bi = {B = B ∪ {i}|B ∈ Bi},ThenHA(x) =

∑B∈Bi

[(−1)#(A\B)GB(x) + (−1)#(A\B)GB(x)

]

=∑

B∈Bi(−1)#(A\B) [GB(x)−GB(x)]

Denoting u = (xBoI\B), we have

GB(x) = − log fX(u)

= − log[fXI\{i}

(uI\{i}) · fXi|XI\{i}(xi|uI\{i})

]

= − log fXI\{i}(uI\{i})− log fXi|X∂i

(xi|u∂i)

Denoting u = (xBoI\B), we see that u and u differ only in i, so uI\{i} = uI\{i}, and so we

can write


GB(x) = − log fXI\{i}(uI\{i})− log fXi|X∂i

(oi|u∂i)

The difference between both is then

GB(x)−GB(x) = − log fXi|X∂i(xi|u∂i) + log fXi|X∂i

(oi|u∂i)

The common term that was anihilated, depended on index j, but what remains does

not, as j 6∈ ∂i, hence all terms in HA(x) do not depend on the value of xj. Hence,

HA(x) = HA(y), where yℓ = xℓ, for ℓ 6= j and yj = oj. We have seen that for such an

argument HA(y) = 0, from which the proof follows.

The proof assumes positivity because the anihilations that take place are implicitly

based on ratios of probabilities (differences of log-probabilities), which are all assumed

to be nonzero.


Importance of Hammersley-Clifford in MCMC

The constructive proof of Hammersley-Clifford shows that given the condi-

tional probabilities in a Markov Model allow to construct the joint distribution

as

fX(x) =1

Z· exp

(−∑

C∈CHC(xC)

)

where for a chosen i ∈ C, and a reference state o

HC(xC) =∑

B⊂C|i∈Blog

(fXi|X∂i

(oi|u∂i)

fXi|X∂i(xi|u∂i)

)

where uj = xj if j ∈ C and uj = oj if j 6∈ C. The partition function Z follows

from the choices of o and i ∈ C.

HC states that conditional probabilities in a Markov model are sufficient to

define the joint probability of a random vector.

This is unlike marginal probabilities: they do not uniquely fix the joint probabil-

ity (as they contain no information about the dependence structure)


A construction without cliques

In some applications (such as the one we will need), the clique potentials are just an

intermediate result. It is possible to construct the joint distribution directly from the

conditional distributions, however, without proving that it factorizes into clique potential

functions.

Simplified theorem (no cliques)fX(x)

fX(o)=

p∏

i=1

fXi|XI\{i}(xi|o{1,...,i−1}x{i+1,...,p})

fXi|XI\{i}(oi|o{1,...,i−1}x{i+1,...,p})

Or, otherwise stated

fX(x) ∝p∏

i=1

fXi|XI\{i}(xi|o{1,...,i−1}x{i+1,...,p})

fXi|XI\{i}(oi|o{1,...,i−1}x{i+1,...,p})

Proof

We start from the right-hand sidep∏

i=1

fXi|XI\{i}(xi|o{1,...,i−1}x{i+1,...,p})

fXi|XI\{i}(oi|o{1,...,i−1}x{i+1,...,p})

=

p∏

i=1

fX(o{1,...,i−1}x{i,...,p})/fXI\{i}

(o{1,...,i−1}x{i+1,...,p})

fX(o{1,...,i}x{i+1,...,p})/fXI\{i}

(o{1,...,i−1}x{i+1,...,p})

All numerators in this product cancel against the denominator in the previous factor,

leaving us with the first denominator and the last numerator, which is exactly the ex-

pression of the left hand side.


Note on the existence of a joint distrubution

Note Hammersley-Clifford does not guarantee existence of the joint distribu-

tion, but if it exists, it is well defined by the conditional probabilities.

Example Consider X1|X2 ∼ exp(λX2) and X2|X1 ∼ exp(λX1), then according

to the construction above, we find that

f(X1,X2)(x1, x2) ∝fX1|X2

(x1|x2)

fX1|X2(o1|x2)

· fX2|X1(x2|o1)

fX2|X1(o2|o1)

= λx2e−λx2·x1

λx2e−λx2·o1· λo1e−λo1·x2λo1e−λo1·o2

∝ e−λx2·x1

The function exp(−λx2x1) has no finite integral on [0,∞[×[0,∞[, and therefore

it cannot be normalized to be a (2D) density function.


From HC to Markov Chain Monte Carlo

• Sample from conditional distributions in MRF X (= any multivariate random

variable)

• Creates sequence of samples X1,X2,X3, . . . that are a Markov chain of

Markov Random Fields

•


2.3 MCMC samplers for integration

2.3.1 The Gibbs sampler

Suppose X is a p-dimensional random vector, and we can sample from con-

ditional densities fXi|XI\{i}(xi|xI\{i}) = fXi|X∂i(xi|x∂i)

Then we construct the following sampler

Set initial values x0 = (x0,1, . . . , x0,p)

for n = 1, 2, . . .

for i = 1, . . . , p

Draw Xn,i ∼ fXi|XI\{i}(x|xn;1,...,i−1xn−1;i+1,...,p)

The Gibbs-sampler consists of loops defined by conditional distributions.

Therefore, the sampler is based on the description of fX(x) as a Markov

random field. Moreover, the sequence can be seen as a Markov Chain.

So, the Gibbs sampler does NOT rely on the description of fX(x) as a Gibbs

random field. GRF will be at the basis of the Metropolis-Hastings sampler on

slide 90


Invariant distribution

On slide 81, we prove:

The joint distribution fX(x) is invariant under the loops of a Gibbs-sampler

We consider the sequence of states after each outer loop (i.e., iterations over

n), not the inner loops (over the vector components).

We consider the case of a discrete state space.

Lemma The transition probabilities over the outer loops satisfy

fXn+1|Xn(x|v) =

p∏

i=1

fXi|XI\{i}(xi|x1,...,i−1vi+1,...,p)

Proof (discrete case)

This follows from the chain rule P

(p⋂

i=1

Ai

∣∣∣∣∣B)

=

p∏

i=1

P

Ai

∣∣∣∣∣∣

i−1⋂

j=1

Aj ∩B

where

in our case Ai = {Xn+1;i = xi} and B = {Xn = v} �


Invariant distribution: proof

We now consider the case of a discrete state space, and suppose that

fXn(x) = fX(x), then

fXn+1(x) =

∑v fXn+1|Xn

(x|v) · fXn(v)

=∑

v

∏pi=1 fXi|XI\{i}(xi|x1,...,i−1vi+1,...,p) · fX(v)

=∑

vp· · ·∑v1

fX(v) · fX1|XI\{1}(x1|v2,...,p) ·∏p

i=2 fXi|XI\{i}(xi|x1,...,i−1vi+1,...,p)

=∑

vp· · ·∑v2

fX2,...,p(v2,...,p) · fX1|XI\{1}(x1|v2,...,p) · . . .∏p

i=2 fXi|XI\{i}(xi|x1,...,i−1vi+1,...,p)

=∑

vp· · ·∑v2

fX(x1v2,...,p) ·∏p

i=2 fXi|XI\{i}(xi|x1,...,i−1vi+1,...,p)

= · · ·= fX(x)

In the expressions above, we used that∑

v1

fX(v) = fX2,...,p(v2,...,p)

NB: The notation X2,...,p refers to the components of X, not to successive

Markov Chain realisations like in Xn.

We then used fX2,...,p(v2,...,p) · fX1|XI\{1}(x1|v2,...,p) = fX(x1v2,...,p)


Reversibility

The proof of the invariance property of fX(x) w.r.t. the Gibbs sampler estab-

lished a global balance equation, not a detailed balance equation. A detailed

balance equation is necessary for reversibility.

The Gibbs-sampler as a whole is not reversible, meaning

fXn−1|Xn(xn−1|xn) 6= fXn+1|Xn

(xn−1|xn)

The probability that we arrive in xn−1 given xn

6= the probability that we come from xn−1 given that we are in xn

Each substep (inner loop) on its own is reversible. That is, if we have gener-

ated a new ith component xi, we could “undo” that step (“undo” in probabilistic

sense, that is). In order to undo the complete Gibbs iteration step, the sub-

steps have to be followed in reverse order.

One can prove that an reversible Gibbs sampler can be constructed by ran-

domizing the order of substeps.


Convergence

Under mild assumption (positivity of fX(x)), the Gibbs sampler creates a

Markov chain for which Xndist−→X ∼ fX

If the Gibbs sampler Markov chain is irreducible and recurrent, then for any

integrable function h(x) we have

1

M

M∑

n=1

h(Xn)P→ E [h(X)]


Foundations for MCMC

MCMC is used for sampling from multidimensional random variables It has

two aspects

• Sampling proceeds through conditional probabilities/densities

• The subsequent samples are dependent→ Markov Chain

We have to make sure that

• Conditionals define the correct joint distribution in a unique way: Hammersley-

Clifford

• The Markov chain replaces the large number convergence

– The target joint distribution is invariant under the Gibbs sampler Markov

Chain

– The chain converges to the invariant distribution

– Although convergence is a limit property, all generated samples of a

Gibbs sampler can be used in estimating the expected value of h(X).


An example from Bayesian statistics

A Hidden Markov Random Field (HMM - Hidden Markov Model)

Suppose that we have the following graphical model for observations Y

Y • • • • • •| | | | | |

M • • • • • •| | | | | |

X • − • − • − • − • − •

•We observe Y , where Yi and Yj are dependent, but conditioned on the

hidden or latent states Xi and Xj they are independent.

• The observation consists of two parts: the real signal (expression) M and

the noise Y −M . Goal: inference on fM |Y (m|y)• The latent state is a binary label: Xi = +1 means that Mi is probably large,

Xi = −1 means that Mi is probably small.

• Large values of Mi are clustered


A formalisation of the graphical model

Suppose X ∈ {−1, 1}p ∼ Ising(τ, γ), that is

P (X = x) =1

T· exp

[−τ

p∑

i=2

xixi−1

]· exp

[−γ

p∑

i=1

xi

]

with partition function T =∑

x∈{−1,1}pexp

[−τ

p∑

i=2

xixi−1

]· exp

[−γ

p∑

i=1

xi

]

We observe Yi = Mi + Vi, with Vi independent normal observational errors

with zero mean and common variance σ2 and Mi a mixture:

Mi =1−Xi

2· Ri +

1 +Xi

2· Si with Ri ∼ N(0, κ2) and Si ∼ N(0, K2)

and all these are independent.

The hyperparameters γ, τ,K2, κ2, σ2 are assumed to be known.


Bayesian inference 1: posterior law of total probability

We want to know E(M |Y )

The posterior total probability is

fMi|Y (m|y) = fMi|Xi=−1;Y (m|y) · P (Xi = −1|Y = y) + fMi|Xi=1;Y (m|y) · P (Xi = 1|Y = y)

The only dependence between components of Y lies in the Hidden Gibbs/Ising random

field, so

fMi|Xi=±1;Y (m|y) = fMi|Xi=±1;Yi(m|yi)Filling in leads to

fMi|Y (m|y) = fMi|Xi=−1;Yi(m|yi) · P (Xi = −1|Y = y) + fMi|Xi=1;Yi(m|yi) · P (Xi = 1|Y = y)

For Xi = −1, Yi = Ri + Vi ∼ N(0, σ2 + κ2).

Hence cov(Mi, Yi) = cov(Ri, Ri + Vi) = var(Ri) + 0 = κ2

And from properties of the multivariate normal distribution (See slides Chapter 1, page

29) we know

(Mi|Yi = y,Xi = −1) = (Ri|Yi = y) ∼ N

(κ2

κ2 + σ2· y, κ2σ2

κ2 + σ2

)

The same holds for Xi = 1, replacing κ2 by K2.

This leads to E(Mi|Y = y) = yi ·[

κ2

κ2 + σ2· P (Xi = −1|Y = y) +

K2

K2 + σ2· P (Xi = 1|Y = y)

]


Bayesian inference 2: posterior label probabilities

We still need P (Xi = −1|Y = y). We compute the marginal posterior proba-

bilities of Xi from the joint posterior: P (X = x|Y = y)

A Gibbs sampler for this posterior probability would draw from

P (Xi = xn,i|XI\{i} = xn;1,...,i−1xn−1;i+1,...,p;Y = y)

= P (Xi = xn,i|XI\{i} = xn;1,...,i−1xn−1;i+1,...,p;Yi = yi)

=P (Xi = xn,i|XI\{i} = xn;1,...,i−1xn−1,i+1,...,p) · fYi|X(yi|xn;1,...,i−1xn,ixn−1;i+1,...,p)

fYi|XI\{i}(yi|xn;1,...,i−1xn−1;i+1,...,p)

= P (Xi = xn,i|XI\{i} = xn;1,...,i−1xn−1,i+1,...,p) ·fYi|Xi

(yi|xn,i)fYi|Xi

(yi|1)·P (Xi=1)+fYi|Xi(yi|−1)·P (Xi=−1)

This expression has three components

1. Conditional probabilities

Yi|Xi = −1 ∼ N(0, κ2 + σ2) and Yi|Xi = 1 ∼ N(0,K2 + σ2)

2. Marginal probabilities

The prior marginal probabilities P (Xi = 1) and P (Xi = −1) have to be

computed from (Markov Chain) Monte Carlo sampling of the prior model.

3. The transition probabilities (see next slide)


Transition probabilities

We know (from the proof of Hammersley-Clifford)

P (Xi = xi|XI\{i} = xI\{i}) = P (Xi = xi|X∂i = x∂i) =

exp

−

∑

C|i∈CHC(xC)

∑

yi

exp

−

∑

C|i∈CHC(yixC\{i})

which is in our case

P (Xi = xi|Xi−1 = xi−1, Xi+1 = xi+1)

=exp (−τ (xixi−1 + xixi+1)) · exp (−γxi)∑

yi∈{−1,1}exp (−τ (yixi−1 + yixi+1)) exp (−γyi)

=exp (−τ (xixi−1 + xixi+1)) · exp (−γxi)

exp (−τ (xi−1 + xi+1)) exp (−γ) + exp (+τ (xi−1 + xi+1)) exp (γ)


2.3.2 Metropolis-Hastings sampler

Gibbs-sampler

1. Based on conditional probabilities in multidimensional random vector ⇒Markov random field

2. If vector components are highly correlated, conditional sampling leads to

values that are close to old ones: slow move through range of possible

values, hence slow convergence

Metropolis-Hastings sampler (MH)

1. Local update of previous sample: Markov Chain of samples (like Gibbs

sampler)

2. Based on joint probabilities⇒ Gibbs random field

A GRF is defined through clique potentials (Slide 65), which requires the computa-

tion of the normalising constant (partition function). MH will be able to sample from

fX(x) if it is known up to such a constant; cfr.: rejection sampling, slide 47

3. One or more dimensions (↔ Gibbs sampler is always for random vectors)

4. Uses rejection sampling: new sample has to be accepted


A proposal/transition distribution

Given a state Xn, a possible new state Xn′ is generated from a distribution

q(x|Xn = xn)

In principle, this proposal distribution can be any good choice. This distribution

should be easy to work with. It typically describes local updates.

The new state is accepted if

Xn+1 = Xn′ ⇔ U ≤ fX(Xn′) · q(Xn|Xn′)

fX(Xn) · q(Xn′|Xn)

where U ∼ uniform[0, 1]


The acceptance probability

Given a proposal Xn′ = xn′ the probability that it is accepted equals

α(xn′;xn) = min

(1,fX(xn′) · q(xn|xn′)

fX(xn) · q(xn′|xn)

)

Remark If the distribution has the form fX(x) =1

Zexp [−H(x)] then the ac-

ceptance probability does not depend on Z. Often Z is very hard to find (inte-

gration/summation over all possible configurations).


Transition probabilities from one state to the next

The transition probability becomes (case of discrete states)

• For xn+1 6= xn:P (Xn+1 = xn+1|Xn = xn)

= P (Xn′ = xn+1|Xn = xn) · P (xn+1 accepted |Xn = xn,Xn′ = xn+1)

= q(xn+1|xn) · α(xn+1,xn)

• The probability that the proposed state (whatever the proposal is) will be

rejected, given that the current state is Xn = xn equals

r(xn) := P (rejected|Xn = xn)

=∑

x

P (Xn′ = x|Xn = xn) · P (x rejected |Xn = xn,Xn′ = x)

=∑

x

q(x|xn) · (1− α(x;xn))

= 1−∑

x

q(x|xn) · α(x;xn)

=: 1− a(xn)

• For xn+1 = xn we obtain

P (Xn+1 = xn|Xn = xn) = q(xn|xn) · α(xn,xn) + (1− a(xn))


Equilibrium distribution

The objective distribution fX(x) is an invariant distribution of a Metropolis-

Hastings sampler

Proof

Denote the transition probabilities Pxy = P (Xn+1 = y|Xn = x)

It holds that Pxy = q(y|x) · α(y;x) + δ(x,y) · (1− a(x))

where δ(x,y) is the Kronecker-delta.

We consider Pxy · fX(x) = q(y|x) ·α(y;x) · fX(x)+ δ(x,y) · (1− a(x)) · fX(x)

We have

α(y;x) · q(y|x) · fX(x) = min(1, fX(y)·q(x|y)

fX(x)·q(y|x)

)· q(y|x) · fX(x)

= min (q(y|x) · fX(x), fX(y) · q(x|y))= min

(fX(x)·q(y|x)fX(y)·q(x|y), 1

)· q(x|y) · fX(y)

= α(x;y) · q(x|y) · fX(y)The Kronecker-delta term is only active if x = y, so formally, one can always

write

δ(x,y) · (1− a(x)) · fX(x) = δ(y,x) · (1− a(y)) · fX(y)


Conclusion of the proof

We may conclude that Pxy · fX(x) = Pyx · fX(y)

This is a detailed balance equation. Not only is the objective distribution in-

variant under the Metropolis-Hastings sampler, but also

The Metropolis-Hastings sampler is reversible


A special case: the original Metropolis sampler

Suppose that the proposal distribution is symmetric in the sense that

q(x|y) = q(y|x)This is realized by chosing Y = X + η where η has a zero mean symmetric

distribution g(η), hence q(y|x) = g(y − x).

Then the acceptance probability for a proposal y given a current state x be-

comes

α(y;x) = min

(1,fX(y) · q(x|y)fX(x) · q(y|x)

)= min

(1,fX(y)

fX(x)

)

This was the original procedure, proposed by Metropolis. It was later refined

by Hastings for arbitrary proposal distributions.


An example of a local update

Consider q(y|x) = fXi|XI\{i}(yi|xI\{i}), if yI\{i} = xI\{i} (and otherwise q(y|x) =0)

Configurations with yI\{i} 6= xI\{i} have probability (density) zero.

It then holds

fX(x) · q(y|x) = fX(x) · fXi|XI\{i}(yi|xI\{i})

= fX(x) · fX(xI\{i}yi)fXI\{i}(xI\{i})

and, keeping in mind that yI\{i} = xI\{i},

fX(y) · q(x|y) = fX(y) · fXi|XI\{i}(xi|yI\{i})

= fX(y) · fX(yI\{i}xi)

fXI\{i}(yI\{i})

= fX(yixI\{i}) ·fX(x)

fXI\{i}(xI\{i})= fX(x) · q(y|x)


An example of a local update: Gibbs sampler

From the previous slide, it follows

1. The acceptance probability α(y;x) = min

(1,fX(y) · q(x|y)fX(x) · q(y|x)

)= 1

2. The process is reversible

In fact, this is one step of a Gibbs sampler.

1. In a general Metropolis-Hastings sampler, there is a free proposal, which is

evaluated: the evaluation uses the joint distribution

2. In the specific case of the Gibbs sampler, the proposal uses the conditional

distribution and there is no evaluation afterwards (so no joint distribution)


Convergence

We have seen that the objective distribution is invariant under a Metropolis-

Hastings sampler.

This is not enough for good convergence.

Indeed, the case of one step of a Gibbs sampler illustrates that updates may

be too local: in this case, only one component of the vector is subject to pos-

sible change. That implies that many states x are unreachable. The Markov

Chain is then reducible.

Irreducibility is obtained if q(y|x) > 0 for all pairs (x,y).

As this condition is sometimes too restrictive for every sampler separately, one

may consider a combination of different proposal distributions, e.g., sequence

of one-at-a-time component Metropolis-Hastings sampler. (e.g., the Gibbs

sampler)


Choice of the proposal distribution

The speed of convergence in a Metropolis-Hastings sampler depends on the

correlation between subsequent samples.

High correlation→ slow convergence

Subsequent samples should be as independent as possible. (Fully indepen-

dent samples are optimal, in the sense that the limiting distribution is reached

instantaneously, but they are often difficult to realize or difficult to sample from)

Inter-sample dependence depends on two adversary objectives

• Acceptance probability: low acceptance probability means high probability

that two subsequent samples are identical, hence, high correlation.

Acceptance probability is enhanced by proposal distributions with small

variance, i.e., a distribution that favours very local updates.

• Correlation between current state and proposed state. This source of cor-

relation is reduced by proposals that favour large updates.


Chapter 3: Monte Carlo methods Overview Maarten...

Documents

Transcript of Chapter 3: Monte Carlo methods Overview Maarten...