Math 5040 Markov chain Monte Carlo methods

Math 5040Markov chain Monte Carlo methods

S. Ethier

References:1. Sheldon Ross, Probability Models, Section 4.9.

2. Gregory Lawler, Intro. to Stoch. Proc., Section 7.3.3. Persi Diaconis, The Markov Chain Monte Carlo Revolution, BAMS, 2008.

Let h be a positive function on a large finite set S. How do wesimulate an S-valued random variable X with distribution

π(x) = P(X = x) =h(x)∑

y∈S h(y). (1)

Typically, we do not know the value of the sum in thedenominator. For example, h could be identically 1, but wemight not know or be able to count the number of elements thatare in S.

The idea of MCMC is to find a Markov chain in S with stationarydistribution π. Simulate the Markov chain for sufficiently manysteps for it to be approximately in equilibrium. The proportion oftime spent in state i will be approximately π(i).

ExampleLet S be the set of all 25× 25 matrices of 0s and 1s with noadjacent 1s in the same row or column. Let M be uniformlydistributed over S. We want to simulate the expectedproportions of 1s in M. The first problem is to construct theMarkov chain.

Consider the following one-step transition. Starting at M ∈ S,choose an entry (i , j) at random from 1,2, . . . ,252. IfM(i , j) = 1, then change the entry at (i , j) to 0. If M(i , j) = 0,then change the entry at (i , j) to 1 if the resulting matrix belongsto S; otherwise leave it unchanged. Then P(M1,M2) = 1/(25)2

if M1 and M2 are elements of S that differ in exactly one entry,and P(M1,M1) = j/(25)2, where j is the number of zero entriesof M1 that cannot be changed to a 1.

This MC is irreducible and aperiodic with P symmetric, hencedoubly stochastic. So the uniform distribution is stationary.

In[89]:= K = 25;M = ConstantArray@0, 8K, K<D;steps = 1 000000;ones = 0;sum = 0;For@n = 1, n £ steps, n++,

i = Floor@K * RandomReal@DD + 1;j = Floor@K * RandomReal@DD + 1;If@M@@i, jDD ä 1, M@@i, jDD = 0; ones--,If@Hi ä 1 »» M@@Max@i - 1, 1D, jDD ä 0L &&Hi ä K »» M@@Min@i + 1, KD, jDD ä 0L &&Hj ä 1 »» M@@i, Max@j - 1, 1DDD ä 0L &&Hj ä K »» M@@i, Min@j + 1, KDDD ä 0L,M@@i, jDD = 1; ones++DD; sum += onesD;

Print@N@sum ê Hsteps * K * KLDD;0.230537

Remark. Instead of sampling the Markov chain at every step,we might sample it every 2500 steps, say, and then theobservations would be closer to independent.

How would we estimate the size of the set of 25× 25 matricesof 0s and 1s with no adjacent 1s in the same row or column?This is a tricky problem, but see Lawler and Coyle’s Lectures onContemporary Probability (1999) for a suggested solution. It isknown that the number |SN | of N × N matrices of 0s and 1swith no adjacent 1s in the same row or column satisfies

β = limN→∞

|SN |1/N2,

where 1.50304 ≤ β ≤ 1.50351, or roughly |SN | ≈ βN2.

ReversibilityThe method works well when the function h is identically 1. Butin the general case we need the concept of reversibility.Consider a M.C. X0,X1,X2, . . . with transition matrix P andstationary initial distribution π, hence it is a stationary process.Now reverse the process: Xn,Xn−1,Xn−2, . . .. It is Markovianwith transition probabilities

Q(i , j) = P(Xn = j | Xn+1 = i) =P(Xn = j)P(Xn+1 = i | Xn = j)

P(Xn+1 = i)

=π(j)P(j , i)π(i)

.

If Q(i , j) = P(i , j) for all i , j , or equivalently,

π(i)P(i , j) = π(j)P(j , i) for all i , j . (2)

then the M.C is said to be reversible.

The equations

π(i)P(i , j) = π(j)P(j , i) for all i , j .

are called the detailed balance equations.Given a transition matrix P with a uniques stationarydistribution π, if we can find x(i) such that

x(i)P(i , j) = x(j)P(j , i) for all i , j ,

and∑

i x(i) = 1, then summing over i we get

x(j) =∑

i

x(i)P(i , j),

so we get x(i) = π(i) by uniqueness of stationary distributions.

ExampleRecall the random walk on a connected finite graph. There areweights wij associated with the edge from i to j . If we define

P(i , j) =wij∑k wik

,

where the sum extends over all neighbors of vertex i , we havean irreducible finite M.C., so there is a unique stationarydistribution. What is it? Check detailed balance equations. Usewij = wji to derive

π(i) =

∑j wij∑

i∑

j wij.

How do we find a Markov chain for general h? Here is theHastings–Metropolis algorithm. Let Q be any irreducibletransition matrix on S and define P to be the transition matrix

P(i , j) =

Q(i , j)α(i , j) if i 6= j ,Q(i , i) +

∑k :k 6=i Q(i , k)(1− α(i , k)) if i = j ,

(3)

where

α(i , j) = min(

h(j)Q(j , i)h(i)Q(i , j)

,1). (4)

Interpretation: Instead of making a transition from i to j withprobability Q(i , j), we “censor” that transition with probability1− α(i , j). That is, with probability α(i , j) the proposedtransition from i to j is accepted, and with probability 1− α(i , j)the proposed transition from i to j is rejected and the chainstays at i .

If α(i , j) = h(j)Q(j , i)/[h(i)Q(i , j)], then α(j , i) = 1 and

h(i)P(i , j) = h(i)Q(i , j)α(i , j) = h(i)Q(i , j)h(j)Q(j , i)/[h(i)Q(i , j)]

= h(j)Q(j , i)α(j , i) = h(j)P(j , i) (5)

Similarly if α(i , j) = 1. So detailed balance eqs. hold for P. Inparticular, the unique stationary distribution is proportional tothe function h.

ExampleLet S be the set of all K × K matrices of 0s and 1s. (Physicistsoften take matrices of −1s and 1s instead to indicate spinorientation.) Let g be a symmetric function on 0,12 (i.e.,g(0,1) = g(1,0)). Let β be a positive parameter. Define theenergy of the matrix M to be

E =∑

(i,j)∼(i ′,j ′)g(M(i , j),M(i ′, j ′)), (6)

where (i , j) ∼ (i ′, j ′) means that (i , j) and (i ′, j ′) are nearestneighbors, that is, |i − i ′|+ |j − j ′| = 1. We assume that theprobability distribution of interest on S gives equal weight to allmatrices with the same energy, so we let

h(M) = e−βE . (7)

The previous example is the limiting case withg(0,0) = g(0,1) = g(1,0) = 0, g(1,1) = 1, and β =∞.Another important example is the Ising model, withg(0,0) = g(1,1) = 0 and g(0,1) = 1.(Configurations—matrices—with many pairs of neighboringspins pointing in opposite directions have high energy.)

Starting with M1, choose an entry at random from the K 2

possibilities and let M2 be the matrix obtained by changing thechosen entry. If h(M2)/h(M1) ≥ 1, then return matrix M2. Ifh(M2)/h(M1) = q < 1, then with probability q return the matrixM2 and with probability 1− q return the matrix M1. This isprecisely the Hastings–Metropolis algorithm.

Gibbs sampler

There is a widely used version of the Hastings–Metropolisalgorithm called the Gibbs sampler. Here we want to simulatethe distribution of an n-dimensional random vectorX = (X1,X2, . . . ,Xn). Again, its probability density π(x) isspecified only up to a constant multiple. We assume that wecan simulate a random variable X with density equal to theconditional density of Xi , given the values of Xj for all j 6= i .

Start from x = (x1, . . . , xn). Choose a coordinate i at random.Then choose a coordinate x according to the conditionaldensity of Xi , given that Xj = xj for all j 6= i . Consider the statey = (x1, . . . , xi−1, x , xi+1, . . . , xn) as a candidate for a transition.Apply Hastings–Metropolis with . . .

Q(x ,y) =1n

P(Xi = x | Xj = xj for all j 6= i)

=π(y)

nP(Xj = xj for all j 6= i).

The acceptance probability is

α(x ,y) = min(π(y)Q(y ,x)

π(x)Q(x ,y),1)

= min(π(y)π(x)

π(x)π(y),1)

= 1.

Hence the candidate star is always accepted.

The method also applies when the random variables arecontinuous.

ExampleSuppose we want to simulate n points in the unit disk at least dapart. Take n = 35 and d = 1/4.

We can use the Gibbs sampler. Initialize as follows.

-1.0 -0.5 0.0 0.5 1.0-1.0

-0.5

0.0

0.5

1.0

At each step, choose one of the n points at random. Thensimulate a random point in the unit disk (rejection method) untilwe get one that is a distance at least d from each of the otherpoints. Replace the i th point with the new one.

n = 35; d = 1 ê 4;coords = ConstantArray@0, 8n, 2<D;coords = 88-.75, 0.<, 8-.5, 0.<, 8-.25, 0.<, 80., 0.<,8.25, 0.<, 8.5, 0.<, 8.75, 0.<, 8-.75, .25<, 8-.5, .25<,8-.25, .25<, 80., .25<, 8.25, .25<, 8.5, .25<, 8.75, .25<,8-.75, -.25<, 8-.5, -.25<, 8-.25, -.25<, 80., -.25<,8.25, -.25<, 8.5, -.25<, 8.75, -.25<, 8-.75, .5<,8-.5, .5<, 8-.25, .5<, 80., .5<, 8.25, .5<, 8.5, .5<,8.75, .5<, 8-.75, -.5<, 8-.5, -.5<, 8-.25, -.5<,80., -.5<, 8.25, -.5<, 8.5, -.5<, 8.75, -.5<<;Show@ContourPlot@x^2 + y^2 ä 1, 8x, -1, 1<, 8y, -1, 1<D,ListPlot@Table@coords, 8i, 1, n<DDD

For@run = 1, run £ 1000, run++,m = IntegerPart@n RandomReal@DD + 1; For@try = 1, try £ 1000, try++,x = 2 RandomReal@D - 1; y = 2 RandomReal@D - 1;If@x^2 + y^2 < 1, flag = 1; For@i = 1, i £ n, i++,

If@Sqrt@Hcoords@@i, 1DD - xL^2 + Hcoords@@i, 2DD - yL^2D £ d&& i π m, flag = 0DD;

If @flag ä 1, coords@@m, 1DD = x; coords@@m, 2DD = y;try = 1001DDDD;

Show@ContourPlot@x^2 + y^2 ä 1, 8x, -1, 1<, 8y, -1, 1<D,ListPlot@Table@coords, 8i, 1, n<DDD

Output:

-1.0 -0.5 0.0 0.5 1.0-1.0

-0.5

0.0

0.5

1.0

ExampleSelf-avoiding walks (SAWs). (From Lawler–Coyle book.) ASAW of length n in Zd is a sequence of pathsω = (ω0, ω1, . . . , ωn) with ω0 = 0, ωi ∈ Zd , |ωi − ωi−1| = 1, andωi 6= ωj whenever i 6= j . SAWs are models for polymer chains inchemistry.

Let Ωn be the set of all SAWs of length n. What is |Ωn|? Noticethat

dn ≤ |Ωn| ≤ 2d(2d − 1)n−1. (8)

In fact, since|Ωn+m| ≤ |Ωn| |Ωm|, (9)

it can be shown that

limn→∞

|Ωn|1/n = β, (10)

where d ≤ β ≤ 2d − 1.

There are a number of interesting questions that one can askabout SAWs. For example, what is the asymptotic probabilitythat two independent n-step SAWs when combined end to endform a 2n-step SAW? For another example, how large areSAWs, that is, how does E [|ωn|2] behave for large n?

These questions are too hard even for specialists to answer.But perhaps we can get some sense of what is going on bysimulating a random SAW. Notice that the obvious rejectionmethod is extremely inefficient if n is large.

Let us assume from now on that d = 2. We use MCMC. Wewant to simulate the uniform distribution on Ωn for somespecified n. Our approach is based on what is called the pivotalgorithm.

Let O denote the set of orthogonal transformations of the planethat map Z2 onto Z2. These include the rotations by π/2, π, and3π/2, and the reflections about the coordinate axes and aboutthe lines y = x and y = −x . (We exclude the identitytransformation from O, so |O| = 7.) Consider the following MCin Ωn: Starting with ω = (ω0, ω1, . . . , ωn), choose a number atrandom from 0,1, . . . ,n − 1 and call it k . Choose atransformation T at random from O. Consider the walkobtained by fixing the first k steps of the walk but performingthe transformation T on the remaining part of the walk, usingωk as the origin for the transformation. This gives us a new pathwhich may or may not be a SAW. If it is, return it as the newpath; if it isn’t, return the original path. This gives us a MC onΩn. It is irreducible and aperiodic with a symmetric p. So thelimiting distribution is uniform.

L = 100;saw = ConstantArray@0, 8L + 1, 2<D;saw1 = ConstantArray@0, 8L + 1, 2<D;saw2 = ConstantArray@0, 8L + 1, 2<D;saw3 = ConstantArray@0, 8L + 1, 2<D;For@n = 0, n £ L, n++,

saw@@n + 1, 1DD = n; saw@@n + 1, 2DD = 0D;H* initial saw is a straight line to the right *LFor@step = 1, step £ 2000, step++,

k = Floor@L RandomReal@DD; m = Floor@7 RandomReal@DD + 1;For@n = 0, n £ k, n++,saw1@@n + 1, 1DD = saw@@n + 1, 1DD; saw1@@n + 1, 2DD = saw@@n + 1, 2DDD;H* first k steps of saw are unchanged in saw1 *L

For@n = k, n £ L, n++,saw2@@n - k + 1, 1DD = saw@@n + 1, 1DD - saw@@k + 1, 1DD;saw2@@n - k + 1, 2DD = saw@@n + 1, 2DD - saw@@k + 1, 2DDD;H* remaining steps of saw are saved, after shifting origin, in saw2 *LH* Now we transform saw2 *L

If@m ä 1, For@j = 0, j £ L - k, j++,saw3@@j + 1, 1DD = -saw2@@j + 1, 2DD; saw3@@j + 1, 2DD = saw2@@j + 1, 1DDDD;H* rotation of piê2 *L

If@m ä 2, For@j = 0, j £ L - k, j++,saw3@@j + 1, 1DD = -saw2@@j + 1, 1DD; saw3@@j + 1, 2DD = -saw2@@j + 1, 2DDDD;H* rotation of pi *L

If@m ä 3, For@j = 0, j £ L - k, j++,saw3@@j + 1, 1DD = saw2@@j + 1, 2DD; saw3@@j + 1, 2DD = -saw2@@j + 1, 1DDDD;H* rotation of 3piê2 *L

If@m ä 4, For@j = 0, j £ L - k, j++,saw3@@j + 1, 1DD = saw2@@j + 1, 1DD; saw3@@j + 1, 2DD = -saw2@@j + 1, 2DDDD;H* reflection about x-axis *L

If@m ä 5, For@j = 0, j £ L - k, j++,saw3@@j + 1, 1DD = -saw2@@j + 1, 1DD; saw3@@j + 1, 2DD = saw2@@j + 1, 2DDDD;H* reflection about y-axis *L

If@m ä 6, For@j = 0, j £ L - k, j++,saw3@@j + 1, 1DD = saw2@@j + 1, 2DD; saw3@@j + 1, 2DD = saw2@@j + 1, 1DDDD;H* reflection about y=x *L

If@m ä 7, For@j = 0, j £ L - k, j++,saw3@@j + 1, 1DD = -saw2@@j + 1, 2DD; saw3@@j + 1, 2DD = saw2@@j + 1, 1DDDD;H* reflection about y=-x *L

For@n = k, n £ L, n++,saw1@@n + 1, 1DD = saw@@k + 1, 1DD + saw3@@n - k + 1, 1DD;saw1@@n + 1, 2DD = saw@@k + 1, 2DD + saw3@@n - k + 1, 2DDD;H* insert transformed saw segment *L

flag = 1; For@n = 0, n £ L, n++, For@nn = n + 1, nn £ L, nn++,If@saw1@@n + 1, 1DD ä saw1@@nn + 1, 1DD && saw1@@n + 1, 2DD ä saw1@@nn + 1, 2DD,flag = 0DDD; H* check whether saw1 is a saw *L

If@flag ä 1, For@n = 0, n £ L, n++, saw@@n + 1, 1DD = saw1@@n + 1, 1DD;saw@@n + 1, 2DD = saw1@@n + 1, 2DDDDD;H* if it is, return saw1, otherwise do nothing *L

Print@sawDListPlot@saw, Joined Æ True, AspectRatio Æ 1D

5 10 15 20

5

10

15

20

Code-breaking example.Reference: Diaconis article: http://www-stat.stanford.edu/~cgates/PERSI/papers/MCMCRev.pdf.

A substitution code uses a map f : code space 7→usual alphabet, space, comma, period, digits, etc., which isunknown. First step is to download a standard e-text and countthe frequencies of the various one-step transitions, therebygetting an estimate of the one-step transition matrix M betweenconsecutive letters. (This is not the MC that we will simulate.)Then we define the plausibility of a particular f by

h(f ) =∏

i

M(f (si), f (si+1)).

where s1, s2, . . . is the coded message. Ideally, we would like tofind the f that maximizes this function. Instead we will applyMCMC to simulate from the distribution π(f ) := h(f )/

∑g h(g).

http://www-stat.stanford.edu/~cgates/PERSI/papers/MCMCRev.pdf

http://www-stat.stanford.edu/~cgates/PERSI/papers/MCMCRev.pdf

Now suppose our symbol space has m elements and ouralphabet space has n ≥ m elements. Then let S be the set ofone-to-one functions f , so |S| = (n)m, which is large ifm = n = 40, say. (We ignore distinction between upper caseand lower case letters.)

What will our MC in S be? Let P(f , f ∗) be determined byrequiring that a transition from f correspond to a random switchof two symbols. The P is symmetric, and we apply theHastings–Metropolis algorithm to get Q.

Start with a preliminary guess, f . Change to f ∗ by a randomtransposition of the values f assigns to two symbols. Ifh(f ∗)/h(f ) ≥ 1, accept f ∗. If q := h(f ∗)/h(f ) < 1, accept f ∗ withprobability q. Otherwise stay with f .

where si runs over consecutive symbols in the coded message. Functions f which have high values of Pl(f)are good candidates for decryption. Maximizing f ’s were searched for by running the following Markov chainMonte Carlo algorithm:

• Start with a preliminary guess, say f .

• Compute Pl(f).

• Change to f by making a random transposition of the values f assigns to two symbols.

• Compute Pl(f); if this is larger than Pl(f), accept f.

• If not, flip a Pl(f)/Pl(f) coin; if it comes up heads, accept f.

• If the coin toss comes up tails, stay at f .

The algorithm continues, trying to improve the current f by making random transpositions. The coin tossesallow it to go to less plausible f ’s, and keep it from getting stuck in local maxima.

Of course, the space of f ’s is huge (40! or so). Why should this Metropolis random walk succeed? Toinvestigate this, Marc tried the algorithm out on a problem to which he knew the answer. Figure 2 shows awell-known section of Shakespeare’s Hamlet.

Figure 2

The text was scrambled at random and the Monte Carlo algorithm was run. Figure 3 shows sampleoutput.

Figure 3

After 100 steps, the message is a mess. After two thousand steps, the decrypted message makes sense.It stays essentially the same as further steps are tried. I find it remarkable that a few thousand steps of thissimple optimization procedure work so well. Over the past few years, friends in math and computer science

2

where si runs over consecutive symbols in the coded message. Functions f which have high values of Pl(f)are good candidates for decryption. Maximizing f ’s were searched for by running the following Markov chainMonte Carlo algorithm:

• Start with a preliminary guess, say f .

• Compute Pl(f).

• Change to f by making a random transposition of the values f assigns to two symbols.

• Compute Pl(f); if this is larger than Pl(f), accept f.

• If not, flip a Pl(f)/Pl(f) coin; if it comes up heads, accept f.

• If the coin toss comes up tails, stay at f .

The algorithm continues, trying to improve the current f by making random transpositions. The coin tossesallow it to go to less plausible f ’s, and keep it from getting stuck in local maxima.

Of course, the space of f ’s is huge (40! or so). Why should this Metropolis random walk succeed? Toinvestigate this, Marc tried the algorithm out on a problem to which he knew the answer. Figure 2 shows awell-known section of Shakespeare’s Hamlet.

Figure 2

The text was scrambled at random and the Monte Carlo algorithm was run. Figure 3 shows sampleoutput.

Figure 3

After 100 steps, the message is a mess. After two thousand steps, the decrypted message makes sense.It stays essentially the same as further steps are tried. I find it remarkable that a few thousand steps of thissimple optimization procedure work so well. Over the past few years, friends in math and computer science

2

The Markov Chain Monte Carlo Revolution

Persi Diaconis

Abstract

The use of simulation for high dimensional intractable computations has revolutionized applied math-ematics. Designing, improving and understanding the new tools leads to (and leans on) fascinatingmathematics, from representation theory through micro-local analysis.

1 Introduction

Many basic scientific problems are now routinely solved by simulation: a fancy “random walk” is performedon the system of interest. Averages computed from the walk give useful answers to formerly intractableproblems. Here is an example drawn from course work of Stanford students Marc Coram and Phil Beineke.

Example 1 (Cryptography). Stanford’s Statistics Department has a drop-in consulting service. One day,a psychologist from the state prison system showed up with a collection of coded messages. Figure 1 showspart of a typical example.

Figure 1

The problem was to decode these messages. Marc guessed that the code was a simple substitution cipher,each symbol standing for a letter, number, punctuation mark or space. Thus, there is an unknown functionf

f : code space ! usual alphabet.

One standard approach to decrypting is to use the statistics of written English to guess at probable choicesfor f , try these out, and see if the decrypted messages make sense.

To get the statistics, Marc downloaded a standard text (e.g., War and Peace) and recorded the first-order transitions: the proportion of consecutive text symbols from x to y. This gives a matrix M(x, y) oftransitions. One may then associate a plausibility to f via

Pl(f) =Y

i

M (f(si), f(si+1))

Departments of Mathematics and Statistics, Stanford University

courses have designed homework problems around this example [17]. Students are usually able to successfullydecrypt messages from fairly short texts; in the prison example, about a page of code was available.

The algorithm was run on the prison text. A portion of the final result is shown in Figure 4. It gives auseful decoding that seemed to work on additional texts.

Figure 4

I like this example because a) it is real, b) there is no question the algorithm found the correct answer,and c) the procedure works despite the implausible underlying assumptions. In fact, the message is in a mixof English, Spanish and prison jargon. The plausibility measure is based on first-order transitions only. Apreliminary attempt with single-letter frequencies failed. To be honest, several practical details have beenomitted: we allowed an unspecified “?” symbol in the deviation (with transitions to and from “?” beinginitially uniform). The display in Figure 4 was ‘cleaned up’ by a bit of human tinkering. I must also add thatthe algorithm described has a perfectly natural derivation as Bayesian statistics. The decoding function f isa parameter in a model specifying the message as the output of a Markov chain with known transition matrixM(x, y). With a uniform prior on f , the plausibility function is proportional to the posterior distribution.The algorithm is finding the mode of the posterior.

In the rest of this article, I explain Markov chains and the Metropolis algorithm more carefully inSection 2. A closely related Markov chain on permutations is analyzed in Section 3. The arguments usesymmetric function theory, a bridge between combinatorics and representation theory.

A very di↵erent example — hard discs in a box — is introduced in Section 4. The tools needed for studyare drawn from analysis, micro-local techniques (Section 5) along with functional inequalities (Nash andSobolev inequalities).

Throughout, emphasis is on analysis of iterates of self-adjoint operators using the spectrum. Thereare many other techniques used in modern probability. A brief overview, together with pointers on how abeginner can learn more, is in Section 6.

2 A Brief Treatise on Markov Chains

2.1 A finite case

Let X be a finite set. A Markov chain is defined by a matrix K(x, y) with K(x, y) 0,P

y K(x, y) = 1for each x. Thus each row is a probability measure so K can direct a kind of random walk: from x, choosey with probability K(x, y); from y choose z with probability K(y, z), and so on. We refer to the outcomesX0 = x,X1 = y,X2 = z, . . . as a run of the chain starting at x. From the definitions P (X1 = y|X0 = x) =K(x, y), P (X1 = y,X2 = z|X0 = x) = K(x, y)K(y, z). From this, P (X2 = z|X0 = x) =

Py K(x, y)K(y, z),

and so on. The nth power of the matrix has x, y entry P (Xn = y|X0 = x).

3

Math 5040 Markov chain Monte Carlo methods

Documents

Transcript of Math 5040 Markov chain Monte Carlo methods