Mixing Dana Randall Georgia Tech A tutorial on Markov chains ( Slides at: randall )

Post on 18-Dec-2015

217 views 0 download

Transcript of Mixing Dana Randall Georgia Tech A tutorial on Markov chains ( Slides at: randall )

Mixing

Dana RandallGeorgia Tech

A tutorial on Markov chains

( Slides at: www.math.gatech.edu/~randall )

Outline

Fundamentals for designing a Markov chain

Bounding running times (convergence rates)

Connections to statistical physics

Main Q: What do typical elements look like?

Determine properties of “typical’’ elements Evaluate thermodynamic properties

(such as free energy, entropy,…)

Estimate the cardinality of the set “Markov chain Monte Carlo’’

Random sampling can be

used to:

Markov chains for sampling

Given: A large set (matchings, colorings,

independent sets,…)

A A

K K

2 2

Andrei Andreyevich Markov 1856-1922

Markov chains

Sampling using Markov chains

State space Ω

( |Ω| ~ cn )

Sampling using Markov chains

State space Ω

Step 1. Connect the state space.

( |Ω| ~ cn )

E.g., if Ω = indep. sets of a graph G, connect I and I’ iff |I I’| = 1.

Basics of Markov chains

Starting at x: - Pick a neighbor y. - Move to y with prob. P(x,y) = 1/∆.

- With all remaining prob. stay at x.

Transitions P: Random walk on H

(max deg in H)

H

Def’n: A MC is ergodic if it is: •irreducible - for all x,y Ω, t: Pt(x,y) > 0; (connected) •aperiodic - g.c.d. t: Pt(x,y) > 0 =1.

(not bipartite)(The “t step” transition prob.)

x

y

The stationary distribution

(1/∆/∆)

Thm: Any finite, ergodic MC converges to a unique stationary distribution π.

Thm: The stationary distribution π satisfies:

(The detailed balance condition)

π(x) P(x,y) = π(y) P(y,x).

P symmetric π is uniform.

˜

So,

E.g., For >0, sample ind. set I w/ prob: π(I) =

where Z = ∑J |J|.

0 21

|I|

Z

Q: What if we want to sample from some other distribution?

Sampling from non-uniform distributions

Step 2. Carefully define the transition probabilities.

The Metropolis Algorithm

Propose a move from x to y as before, but accept with probability min (1, π(y)/π(x))

(with remaining probability stay at x).

(MRRTT ’53)

π(y)/∆π(x)1π(y)π(x)

x

y( if π(x) ≥ π(y) )

π(x) P(x,y) = π(y) P(y,x)

1/∆

For independent sets:

min(1,)

I

I v

min(1,-

1)

π(y) (|I|+1)/Z

π(x) (|I|)/Z= =

Q: But for how long do we walk?

Basics continued…

Step 1. Connect the state space.Step 2. Carefully define the transition probabilities.

Starting at any state x0, take a random walk for some number of steps . . . and output the final state (from ?).

Step 3. Bound the mixing time.

This tells us the number of steps to take.

The mixing rate

Def’n: The total variation distance is ||Pt,π|| = max __ ∑ |Pt(x,y) - π(x)|.

x Ω yΩ 2 1

A Markov chain is rapidly mixing if() is poly (n, log(-1)).

Def’n Given , the mixing time is

= min t: ||Pt’,π|| < , t’ ≥

t.

A

Spectral gap

Let >≥…≥ Ω be the eigenvalues of P.

Def’n: Gap(P) = 1-|2| is the spectral gap.

Mixing rate

Spectral Gap

Thm: (Alon, Alon-Milman, Sinclair)

log ( )

≥ log

( ).

Gap(P)

1

2 Gap(P)

|2|

1

π*

1

2

Outline

Fundamentals for designing a Markov chain

Bounding running times (convergence rates)

Connections to statistical physics

Outline for rest of talk

Techniques:

•Coupling

•Flows and paths

•Indirect methods

Problems:

•Walk on the hypercube

•Colorings

•Matchings

•Independent sets

•Connections with statistical physics: - problems - algorithms - physical insights

Coupling

Coupling

Once they agree, they move in sync (xt=yt

xt+1=yt+1)

Couple moves, but each simulates the MC

Start at any x0 and y0

x0

y0Simulate 2 processes:

Def’n: A coupling is a MC on Ω x Ω:1) Each process Xt, Yt is a faithful

copy of the original MC,

2) If Xt = Yt, then Xt+1 = Yt+1.

Coupling

T = max ( E [ Tx,y ] ), where Tx,y = min t: Xt=Yt | X0=x, Y0=y.

x,y

The coupling time T is:

Thm: () ≤ T e ln -1 . (Aldous’81)

Ex1: Walk on the hypercube

MCCUBE:• Start at v0=(0,0,…,0).• Repeat: - Pick i [n], b 0,1. - Set vi = b.

Symmetric, ergodic π is uniform.

Mixing time? Use coupling:

x0 = 0 1 1 0 0 1 y0 = 1 1 1 0 0 0

i=2, b=0: x1 = 0 0 1 0 0 1 y1 = 1 0 1 0 0 0

i=6, b=1: x2 = 0 0 1 0 0 1 y2 = 1 0 1 0 0 1

i=1, b=1: xt = 1 0 1 1 1 0 yt = 1 0 1 1 1 0. . .

˜

so T = n log n (coupon

collecting)

() = O ( n ln (n -1).

˜

Outline

Techniques:

•Coupling - path coupling

•Flows and paths

•Indirect methods

Problems:

•Walk on the hypercube

•Colorings

•Matchings

•Independent sets

•Connections with statistical physics: - problems - algorithms - physical insights

Ex 2: Colorings

Given: A graph G (max deg d), k > 1.Goal: Find a random k-coloring of G. MCCOL: (Single point replacement)

• Starting at some k-coloring C0

• Repeat: - With prob 1/2 do nothing. - Pick v V, c [k]; - Recolor v with c, if possible.

The “lazy” chain

If k ≥ d + 2, then the state space is connected.

(Therefore π is uniform.)

Note: k ≥ d + 1 colorings exist.(Greedy)

˜

Path Coupling

Coupling: Show for all x,y , E[ (dist(x,y)) ] < 0.

Path coupling: Show for all u,v s.t. dist(u,v)=1, that E[ (dist(u,v)) ] < 0.

-

-

Consider a shortest path:x = z0, z1, z2, . . . , zr= y, dist(zi,zi+1) = 1 dist(x,y) = r.

[Bubley,Dyer,Greenhill’97-8]

E[ (dist(x,y)) ]

≤ i

E[ (dist(zi,zi+1)) ]

≤ 0.

˜

Path coupling for MCCOL

Thm: MCCOL is rapidly mixing if k ≥ 3d. (Jerrum ‘95)

Pf: Use path coupling: dist(x,y) = 1.

x y

w w

E∆dist ≤ ( (k-d)(-1) + 2d(+1) ) = (3d-k) ≤ 0.

12nk12nk

v = w, c C \ , , : ∆dist = -1,Cases:

v N(w), c , : ∆dist = + 1 (or 0) o.w.: ∆dist = 0.

Summary: Coupling

Pros: Can yield very easy proofs

Cons: Demands a lot from the chain

Extensions: Careful coupling (k ≥ 2d) (Jerrum’95)

Change the MC (Luby-R-

Sinclair’95)

“Macromoves” - burn in (Dyer-Frieze’01, Molloy’02) - non-Markovian couplings (Hayes-Vigoda’03)

Outline

Techniques:

•Coupling

•Flows and paths

•Indirect methods

Problems:

•Walk on the hypercube

•Colorings

•Matchings

•Independent sets

•Connections with statistical physics: - problems - algorithms - physical insights

Conductance and flows

Ω

(Jerrum-Sinclair’88)

= min (S)SΩ, π(S)≤1/2

S SC(S) =

∑ π(s) P(s,s’)

∑ π(s)

sS, s’SC

sS

2 Thm: ≤ Gap(P) ≤ 2 2

x

y

Min cut Max flow

˜

paths: xy: from xΩ, to yΩ, x ≠ y, carrying π(x)π(y) units of flow.

: Make |Ω|2

canonical

(Sinclair’92)

Q(e) = π(u) P(u,v) = π(v) P(v,u).

Capacity of e=(u,v): e

= min l

( lis the max path length )

_

() = max ∑ π(x) π(y) Q(e)

1

xy e

e

The congestion of these paths is:

Ω

Thm: ≤ log ( π(x))-1._

Ex 3: Back to the hypercube

- The complementary pair (u’,v’) determines (s,t), so |

xy e | = 2n-1.

and l= n = Õ(n2).

() = max = = n Q(e)

∑ π(x) π(y)xy e

e

2n-1 2-2n

2-n (1/2n)

˜

s = 0 1 1 0 0 1 t = 1 1 0 0 0 0

Ex 3: Back to the hypercube

s = 0 1 1 0 0 1 t = 1 1 0 0 0 0

Ex 3: Back to the hypercube

1 1 1 0 0 1

s = 0 1 1 0 0 1 t = 1 1 0 0 0 0

Ex 3: Back to the hypercube

1 1 1 0 0 1

1 1 1 0 0 1

s = 0 1 1 0 0 1 t = 1 1 0 0 0 0

Ex 3: Back to the hypercube

1 1 0 0 0 1

1 1 0 0 0 1

1 1 0 0 0 1 t = 1 1 0 0 0 0

1 1 1 0 0 1

1 1 1 0 0 1

u =v =

0 1 0 0 0 0

0 1 0 0 0 0

0 1 1 0 0 0

0 1 1 0 0 0

0 1 1 0 0 0 0 1 1 0 0 1 = s

u’ =v’ =

- Bound the number of paths through (u,v) E.

- Define a canonical path from s to t.

Outline

Techniques:

•Coupling

•Flows and paths

•Indirect methods

Problems:

•Walk on the hypercube

•Colorings

•Matchings

•Independent sets

•Connections with statistical physics: - problems - algorithms - physical insights

Ex 4: Sampling matchings

Ex 4: Sampling matchings

MCMATCH:

Starting at M0, repeat: Pick e = (u,v) E

- If e M, remove e;

- If u and v unmatched in

M, add e;

- If u matched (by e’) and v unmatched (or vice versa), add e and remove e’;

- Otherwise do nothing.

eu v

u ve

e’

eu v

Thm: Coupling won’t work! (Kumar-Ramesh’99)

Mixing time of MCMATCH

s

t

s t

s

t

u

vpaths using (u,v) determined by u’

. . . as before.

u’

Techniques:

•Coupling

•Flows and paths

•Indirect methods

Problems:

•Walk on the hypercube

•Colorings

•Matchings

•Independent sets

•Connections with statistical physics: - problems - algorithms - physical insights

Outline

Goal: Given , sample ind. set I with prob: π(I) = |I|/Z,

Z = ∑J |J|.

Ex 5: Independent Sets

MCIND: Starting at I0, Repeat: - Pick v V and b 0,1; - If v I, b=0, remove v w.p. min (1,-1) - If v I, b=1, add v w.p. min (1,) if possible; - O.w. do nothing.

/

Slow mixing of MCIND (large )

n

n

(nn/2)

10 ∞

S SC

large there is a “bad cut,” . . . so MCIND is slowly mixing.

˜

#R/#B

(Even)

(Odd)

Summary: Flows

Pros: Offers a combinatorial approach to mixing; especially useful for proving slow mixing.

Cons: Requires global knowledge of the chain to spread out paths.

Extensions: Balanced flows (Morris-Sinclair’99) MCMC -- Major highlights: - The permanent (Jerrum-Sinclair-Vigoda’02) - Volume of a convex polytope (Dyer-Frieze-Kannan’89, +… )

Techniques:

•Coupling

•Flows and paths

•Indirect methods - Comparison - Decomposition

Problems:

•Walk on the hypercube

•Colorings

•Matchings

•Independent sets

•Connections with statistical physics: - problems - algorithms - physical insights

Outline

Comparison(Diaconis,Saloff-Coste’93)

unknown

Pknown

P_

w

z

For each edge (x,y) P, make a path x,y using edges in P.

Let (z,w) be the set of paths x,y using (z,w)

_x y

Thm: Gap(P) ≥ Gap(P)._

1A

A = max ∑ |x,y|

π(x)P(x,y)

1

Q(e) exy e

_

Comparison

w

z

(x,y) P x,y (using P)

(z,w) is the set of paths x,y using (z,w)

Thm: Gap(P) ≥ Gap(P)._

1A

x y _known

P

unknownP

_

SS_

SS_

˜

(S,S) cannot be a bad cut in P if it isn’t in P.

__

Adjacency . . . The ˆ Matrix Reloaded

Comparison, aka . . .

Disjoint decomposition

Ω

A1

A3

A2

A6

A5A4

a1

a3 a4

a2

a5

a6

P—

Projection

P3

Restrictions

P

_

π(ai) =

π(Ai)

P(ai,aj) = ∑

π(x)P(x,y)

π(Ai) xAi,

yAj

_

(Madras-R.’96, Martin-R.’00)

Thm: Gap(P) ≥ — Gap(P) (mini Gap(Pi)).12

_

Let Ω = ind. sets of G; Ωk = ind. sets of size k.

For G=(V,E):

Ex 6: MCIND on small ind. sets

MCSWAP:Starting at I0, Repeat: - Pick (u,v,b) V x V x 0,1,2; - If b=0 and u V, remove u w.p. min (1,-1) - If b=1 and u V, add u w.p. min (1,) if possible; - If b=2 remove u and add v (if possible); - O.w. do nothing.

* Consider first the “swap” chain:

/

Thm: MCIND is rapidly mixing

on

Ωk , where K = |V|/2(∆+1).

k = 0

K

Ind. sets w/bounded size (cont.)

Thm: MCIND is rapidly mixing on

Ωk , where K=|V|/2(∆+1).k = 1

K

Ω0 Ω1 Ω2 . . . ΩK-1 ΩK

Ωk

a0 a1 a2 . . .aK-1 aK

ProjectionRestrictions

|ΩK| is logconcave, . . .

so P is rapidly mixing. _

.?

MCSWAP

The Restrictions of MCswap

Ω0 Ω1 Ω2 . . . ΩK-1 ΩK

Ωk

ProjectionRestrictions

.

Thm: MCSWAP is rapidly mixing on Ωk , k < K. (Bubley-Dyer’97)

.

KThm: MCSWAP is rapidly mixing on

Ωk .

k = 1 (Decomposition)

Cor: MCIND is rapidly mixing on Ωk .

k = 1

K

(Comparison)

Summary: Indirect methods

Pros: Offer a top down approach; allow hybrid methods to be used..

Extensions: Comparison thm for log-Sobolev (Diaconis-Saloff-Coste’96) Comparison for Glauber dynamics (R.-Tetali ‘98) Decomposition for log-Sobolev (Jerrum-Son-Tetali-Vigoda ‘02)

Cons: Can increase the complexity.

Techniques:

•Coupling

•Flows and paths

•Hybrid methods

Problems:

•Walk on the hypercube

•Colorings

•Matchings

•Independent sets

•Connections with statistical physics: - problems - algorithms - physical insights

Outline

They have a need for sampling

Use many interesting heuristics

Great intuition

Experts on “large data sets’’

Microscopic

Macroscopic details behavior

(i.e., phase transitions)

Why Statistical Physics?

(3-colorings) (Independent sets)

(Matchings) (Min cut)

- - -- +

Models from statistical physics

Potts model

Hardcore model

Dimer model

--- --

- --

-++

++

+

++

+

+-

Ising model

+

Independent sets:

π(I)=|I|/Z

Models (cont.)

Matchings:

π(M)=|M|/Z

Ising model:

π()= |E |/Z,

E= = u v: (u) =

(v)

(E = E= E≠)

˜

-- --

--++

++

+

++

+

-

+

=

ˇ

Models: (The physics perspective)

Independent sets: H() = -|I|

If = e then π() = |I| /Z.

Given: A physical system Ω = Define: A Gibbs measure as follows:

π() = e-H()/ Z,

H() (the Hamiltonian),

= 1/kT (inverse temperature),

normalizing constant or partition function. where Z = ∑ e

-H() is the

Ising model: H() = -∑ u v

(u,v) E

If = e2 then π() = |E | /Z.=

Physics perspective (cont.)

Q: What about on the infinite lattice? Use conditional probabilities:

?

But there can be boundary effects !!!

Phase transitions: Ind. sets

Low temperature: long range effects

High temperature: ∂ effects die out

regions

……

T∞

T0

Tc

TC indicates a “phase transition.”

Slow mixing of MCIND

revisited

S SC

n

n

(n n)

#R/#B

10

π(Si) = ∑ π(s) e-H(s)/Z

Si

sSi

“Entropy “Energy term” term”

Group by # of “fault lines”

S SC

. . .

Fault lines are vacant pathsof width 2 from top to bottom (or left to right).

SR

S1

SB

S3

S2

“Peierls Argument”

2. Shift right of fault by 1 and flip colors.

For fixed path length l,

S1

SB x 2n/2 x 3l.

1. Identify horizontalor vertical fault line .

( S1)

3. Remove rt column ; add points along fault line, if possible.

( SB)

Peierls Argument cont.

≤ 2n/2 3l

S1 SB

( ≥ l - n/2more points)

≤ π(SB) 2n/2 3n (n/2) (poly(n)) /n)

≤ π(SB) ( )n/2 (poly(n)),

if > 18.

18

π(S1) = ∑ π()eS1

≤ ∑ ∑ π() 2n/2 3l (n/2-l)

l eSB

(and similarly for S2, S3, …)

Conclusions

Techniques:• Coupling: can be easy

when it works

•Flows: requires global knowledge of chain;

very useful for slow mixing

• Connection to physics: can offer tremendous insights

Open problems: . . .

• Indirect methods: top down approach; often increases complexity

Conclusions

Open problems:

...

Sampling 4,5,6-colorings on the grid.

Sampling perfect matchings on non-bipartite graphs. Sampling acyclic orientations in a graph. Sampling configurations of the Potts model (a generalization of Ising, but with more colors).

How can we further exploit phase transitions? Other physical intuition?