Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2....

40
Directed and Undirected Graphical Models Adrian Weller MLSALT4 Lecture Feb 26, 2016 With thanks to David Sontag (NYU) and Tony Jebara (Columbia) for use of many slides and illustrations For more information, see http://mlg.eng.cam.ac.uk/adrian/ 1 / 29

Transcript of Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2....

Page 1: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Directed and Undirected Graphical Models

Adrian Weller

MLSALT4 LectureFeb 26, 2016

With thanks to David Sontag (NYU) and Tony Jebara (Columbia)for use of many slides and illustrations

For more information, seehttp://mlg.eng.cam.ac.uk/adrian/

1 / 29

Page 2: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

High level overview of our 3 lectures

1. LP relaxations for MAP inference (last week)

2. Directed and undirected graphical models (today)

3. Junction tree algorithm for exact inference, belief propagation,variational methods for approximate inference (Monday)

Further reading / viewing:

Murphy, Machine Learning: a Probabilistic Perspective

Barber, Bayesian Reasoning and Machine Learning

Bishop, Pattern Recognition and Machine Learning

Koller and Friedman, Probabilistic Graphical Modelshttps://www.coursera.org/course/pgm

Wainwright and Jordan, Graphical Models, Exponential Families,and Variational Inference

2 / 29

Page 3: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Background: the marginal polytope (all valid marginals)

Marginal polytope!1!0!0!1!1!0!1"0"0"0"0!1!0!0!0!0!1!0"

�µ =

= 0!

= 1! = 0!X2!

X1!

X3 !

0!1!0!1!1!0!0"0"1"0"0!0!0!1!0!0!1!0"

�µ� =

= 1!

= 1! = 0!X2!

X1!

X3 !

1

2

��µ� + �µ

valid marginal probabilities!

(Wainwright & Jordan, ’03)!

Edge assignment for"!

Edge assignment for"X1X2!

Edge assignment for"X2X3!

Assignment for X1 "

Assignment for X2 "

Assignment for X3!

Figure 2-1: Illustration of the marginal polytope for a Markov random field with three nodesthat have states in {0, 1}. The vertices correspond one-to-one with global assignments tothe variables in the MRF. The marginal polytope is alternatively defined as the convex hullof these vertices, where each vertex is obtained by stacking the node indicator vectors andthe edge indicator vectors for the corresponding assignment.

2.2 The Marginal Polytope

At the core of our approach is an equivalent formulation of inference problems in terms ofan optimization over the marginal polytope. The marginal polytope is the set of realizablemean vectors µ that can arise from some joint distribution on the graphical model:

M(G) ={µ ∈ Rd | ∃ θ ∈ Rd s.t. µ = EPr(x;θ)[φ(x)]

}(2.7)

Said another way, the marginal polytope is the convex hull of the φ(x) vectors, one for eachassignment x ∈ χn to the variables of the Markov random field. The dimension d of φ(x) isa function of the particular graphical model. In pairwise MRFs where each variable has kstates, each variable assignment contributes k coordinates to φ(x) and each edge assignmentcontributes k2 coordinates to φ(x). Thus, φ(x) will be of dimension k|V |+ k2|E|.

We illustrate the marginal polytope in Figure 2-1 for a binary-valued Markov randomfield on three nodes. In this case, φ(x) is of dimension 2 · 3 + 22 · 3 = 18. The figure showstwo vertices corresponding to the assignments x = (1, 1, 0) and x′ = (0, 1, 0). The vectorφ(x) is obtained by stacking the node indicator vectors for each of the three nodes, and thenthe edge indicator vectors for each of the three edges. φ(x′) is analogous. There should bea total of 9 vertices (the 2-dimensional sketch is inaccurate in this respect), one for eachassignment to the MRF.

Any point inside the marginal polytope corresponds to the vector of node and edgemarginals for some graphical model with the same sufficient statistics. By construction, the

17

X1X3

3 / 29

Page 4: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Stylized illustration of polytopes

marginal polytope M = Ln

global consistency . . .

triplet polytope L3

triplet consistencylocal polytope L2

pair consistency

More accurate ↔ Less accurate

More computationally intensive ↔ Less computationally intensive

4 / 29

Page 5: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

When is MAP inference (relatively) easy?

Tree

5 / 29

Page 6: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Key ideas for graphical models

1 Represent the world as a collection of random variablesX1, . . . ,Xn with joint distribution p(X1, . . . ,Xn)

2 Learn the distribution from data

3 Perform inference (typically MAP or marginal)

6 / 29

Page 7: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Challenges

1 Represent the world as a collection of random variablesX1, . . . ,Xn with joint distribution p(X1, . . . ,Xn)

How can we compactly describe this joint distribution?Directed graphical models (Bayesian networks)Undirected graphical models (Markov random fields, factor graphs)

2 Learn the distribution from data

Maximum likelihood estimation, other methods?How much data do we need?How much computation does it take?

3 Perform inference (typically MAP or marginal)

Exact inference: Junction Tree AlgorithmApproximate inference (belief propagation, variational methods...)

7 / 29

Page 8: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

How can we compactly describe the joint distribution?

Example: Medical diagnosis

Binary variable for each symptom (e.g. “fever”, “cough”, “fastbreathing”, “shaking”, “nausea”, “vomiting”)

Binary variable for each disease (e.g. “pneumonia”, “flu”,“common cold”, “bronchitis”, “tuberculosis”)

Diagnosis is performed by inference in the model:

p(pneumonia = 1 | cough = 1, fever = 1, vomiting = 0)

One famous model, Quick Medical Reference (QMR-DT), has 600diseases and 4000 symptoms

8 / 29

Page 9: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Representing the distribution

Naively, we could represent the distribution with a big table ofprobabilities for every possible outcome

How many outcomes are there in QMR-DT? 24600

Learning of the distribution would require a huge amount of data

Inference of conditional probabilities, e.g.

p(pneumonia = 1 | cough = 1, fever = 1, vomiting = 0)

would require summing over exponentially many values

Moreover, gives no way to make predictions with previously unseenobservations

We need structure

9 / 29

Page 10: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Structure through independence

If X1, . . . ,Xn are independent, then

p(x1, . . . , xn) = p(x1)p(x2) · · · p(xn)

For binary variables, probabilities for 2n outcomes can be describedby how many parameters?

n

However, this is not a very useful model – observing a variable Xi

cannot influence our predictions of Xj

Instead: if X1, . . . ,Xn are conditionally independent given Y ,denoted as Xi ⊥ X−i | Y , then

p(y , x1, . . . , xn) = p(y)n∏

i=1

p(xi | y)

This is a simple yet powerful model

10 / 29

Page 11: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Structure through independence

If X1, . . . ,Xn are independent, then

p(x1, . . . , xn) = p(x1)p(x2) · · · p(xn)

For binary variables, probabilities for 2n outcomes can be describedby how many parameters? n

However, this is not a very useful model – observing a variable Xi

cannot influence our predictions of Xj

Instead: if X1, . . . ,Xn are conditionally independent given Y ,denoted as Xi ⊥ X−i | Y , then

p(y , x1, . . . , xn) = p(y)n∏

i=1

p(xi | y)

This is a simple yet powerful model

10 / 29

Page 12: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Example: naive Bayes for classification

Classify e-mails as spam (Y = 1) or not spam (Y = 0)Let i ∈ {1, . . . , n} index the words in our vocabularyXi = 1 if word i appears in an e-mail, and 0 otherwiseE-mails are drawn according to some distribution p(Y ,X1, . . . ,Xn)

Suppose that the words are conditionally independent given YThen,

p(y , x1, . . . xn) = p(y)n∏

i=1

p(xi | y)

Easy to learn the model with maximum likelihood. Predict with:

p(Y = 1 | x1, . . . xn) =p(Y = 1)

∏ni=1 p(xi | Y = 1)∑

y={0,1} p(Y = y)∏n

i=1 p(xi | Y = y)

Is conditional independence a reasonable assumption?

A model may be “wrong” but still useful

11 / 29

Page 13: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Example: naive Bayes for classification

Classify e-mails as spam (Y = 1) or not spam (Y = 0)Let i ∈ {1, . . . , n} index the words in our vocabularyXi = 1 if word i appears in an e-mail, and 0 otherwiseE-mails are drawn according to some distribution p(Y ,X1, . . . ,Xn)

Suppose that the words are conditionally independent given YThen,

p(y , x1, . . . xn) = p(y)n∏

i=1

p(xi | y)

Easy to learn the model with maximum likelihood. Predict with:

p(Y = 1 | x1, . . . xn) =p(Y = 1)

∏ni=1 p(xi | Y = 1)∑

y={0,1} p(Y = y)∏n

i=1 p(xi | Y = y)

Is conditional independence a reasonable assumption?

A model may be “wrong” but still useful

11 / 29

Page 14: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Directed graphical models = Bayesian networks

A Bayesian network is specified by a directed acyclic graph

DAG= (V , ~E ) with:1 One node i ∈ V for each random variable Xi

2 One conditional probability distribution (CPD) per node,p(xi | xPa(i)), specifying the variable’s probability conditioned on itsparents’ values

The DAG corresponds 1-1 with a particular factorization of thejoint distribution:

p(x1, . . . xn) =∏

i∈Vp(xi | xPa(i))

12 / 29

Page 15: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Bayesian networks are generative models

naive Bayes

Y

X1 X2 X3 Xn. . .

Features

Label

Evidence is denoted by shading in a node

Can interpret Bayesian network as a generative process. Forexample, to generate an e-mail, we

1 Decide whether it is spam or not spam, by samping y ∼ p(Y )2 For each word i = 1 to n, sample xi ∼ p(Xi | Y = y)

13 / 29

Page 16: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Bayesian network structure ⇒ conditional independencies

Generalizing earlier example, can show that a variable isindependent from its non-descendants given its parents

Common parent – fixing B decouples A and C

!

" #$%&'(%)*'+',-./'011!20113 3

Qualitative Specification

456$6'7869':56';<=>%:=:%?6'9@6&%A%&=:%8)'&8B6'A$8BC

D$%8$'E)8F>67*6'8A'&=<9=>'$6>=:%8)95%@9

D$%8$'E)8F>67*6'8A'B87<>=$'$6>=:%8)95%@9

G99699B6):'A$8B'6H@6$:9

I6=$)%)*'A$8B'7=:=

46'9%B@>J'>%)E'='&6$:=%)'=$&5%:6&:<$6'K6L*L'='>=J6$67'*$=@5M'

N

" #$%&'(%)*'+',-./'011!20113 O1

Local Structures & Independencies

,8BB8)'@=$6):P%H%)*'Q'76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>9'8A'G'=)7','=$6'%)76@6)76):R

,=9&=76S)8F%)*'Q 76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>'*6)6'G'@$8?%769')8'

6H:$='@$67%&:%8)'?=><6'A8$':56'>6?6>'8A'*6)6',R

T29:$<&:<$6S)8F%)*','&8<@>69'G'=)7'Q

U6&=<96'G'&=)'R6H@>=%)'=F=JR'Q'FL$L:L',RVA'G'&8$$6>=:69':8',/':56)'&5=)&6'A8$'Q':8'=>98'&8$$6>=:6':8'Q'F%>>'76&$6=96R

W56'>=)*<=*6'%9'&8B@=&:/':56'&8)&6@:9'=$6'$%&5X

! "#

!

"

#

!

#

"

tail to tail

Cascade – knowing B decouples A and C

!

" #$%&'(%)*'+',-./'011!20113 3

Qualitative Specification

456$6'7869':56';<=>%:=:%?6'9@6&%A%&=:%8)'&8B6'A$8BC

D$%8$'E)8F>67*6'8A'&=<9=>'$6>=:%8)95%@9

D$%8$'E)8F>67*6'8A'B87<>=$'$6>=:%8)95%@9

G99699B6):'A$8B'6H@6$:9

I6=$)%)*'A$8B'7=:=

46'9%B@>J'>%)E'='&6$:=%)'=$&5%:6&:<$6'K6L*L'='>=J6$67'*$=@5M'

N

" #$%&'(%)*'+',-./'011!20113 O1

Local Structures & Independencies

,8BB8)'@=$6):P%H%)*'Q'76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>9'8A'G'=)7','=$6'%)76@6)76):R

,=9&=76S)8F%)*'Q 76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>'*6)6'G'@$8?%769')8'

6H:$='@$67%&:%8)'?=><6'A8$':56'>6?6>'8A'*6)6',R

T29:$<&:<$6S)8F%)*','&8<@>69'G'=)7'Q

U6&=<96'G'&=)'R6H@>=%)'=F=JR'Q'FL$L:L',RVA'G'&8$$6>=:69':8',/':56)'&5=)&6'A8$'Q':8'=>98'&8$$6>=:6':8'Q'F%>>'76&$6=96R

W56'>=)*<=*6'%9'&8B@=&:/':56'&8)&6@:9'=$6'$%&5X

! "#

!

"

#

!

#

"

head to tail

V-structure – Knowing C couples A and B

!

" #$%&'(%)*'+',-./'011!20113 3

Qualitative Specification

456$6'7869':56';<=>%:=:%?6'9@6&%A%&=:%8)'&8B6'A$8BC

D$%8$'E)8F>67*6'8A'&=<9=>'$6>=:%8)95%@9

D$%8$'E)8F>67*6'8A'B87<>=$'$6>=:%8)95%@9

G99699B6):'A$8B'6H@6$:9

I6=$)%)*'A$8B'7=:=

46'9%B@>J'>%)E'='&6$:=%)'=$&5%:6&:<$6'K6L*L'='>=J6$67'*$=@5M'

N

" #$%&'(%)*'+',-./'011!20113 O1

Local Structures & Independencies

,8BB8)'@=$6):P%H%)*'Q'76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>9'8A'G'=)7','=$6'%)76@6)76):R

,=9&=76S)8F%)*'Q 76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>'*6)6'G'@$8?%769')8'

6H:$='@$67%&:%8)'?=><6'A8$':56'>6?6>'8A'*6)6',R

T29:$<&:<$6S)8F%)*','&8<@>69'G'=)7'Q

U6&=<96'G'&=)'R6H@>=%)'=F=JR'Q'FL$L:L',RVA'G'&8$$6>=:69':8',/':56)'&5=)&6'A8$'Q':8'=>98'&8$$6>=:6':8'Q'F%>>'76&$6=96R

W56'>=)*<=*6'%9'&8B@=&:/':56'&8)&6@:9'=$6'$%&5X

! "#

!

"

#

!

#

"

This important phenomona is called explaining awayp(A,B,C ) = p(A)p(B)p(C | A,B) head to head

14 / 29

Page 17: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

D-separation (“directed separation”) in Bayesian networks

Bayes Ball Algorithm to determine whether X ⊥ Z | Y by lookingat graph d-separation

Look to see if there is active path between X and Z whenvariables Y are observed:

(a)

X

Y

Z X

Y

Z

(b)

(a)

X

Y

Z

(b)

X

Y

Z

15 / 29

Page 18: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

D-separation (“directed separation”) in Bayesian networks

Bayes Ball Algorithm to determine whether X ⊥ Z | Y by lookingat graph d-separation

Look to see if there is active path between X and Z whenvariables Y are observed:

(a)

X

Y

Z X

Y

Z

(b)

(a)

X

Y

Z

(b)

X

Y

Z

16 / 29

Page 19: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

D-separation (“directed separation”) in Bayesian networks

Bayes Ball Algorithm to determine whether X ⊥ Z | Y by lookingat graph d-separation

Look to see if there is active path between X and Z whenvariables Y are observed:

X Y Z X Y Z

(a) (b)

If no such path, then X and Z are d-separated with respect to Y

d-separation reduces statistical independencies (hard) toconnectivity in graphs (easy)

Important because it allows us to quickly prune the Bayesiannetwork, finding just the relevant variables for answering a query

17 / 29

Page 20: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

D-separation example 1

1X

2X

3X

X 4

X 5

X6

1X

2X

3X

X 4

X 5

X6

18 / 29

Page 21: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

D-separation example 2

1X

2X

3X

X 4

X 5

X6

1X

2X

3X

X 4

X 5

X6

19 / 29

Page 22: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

2011 Turing Award was for Bayesian networks

20 / 29

Page 23: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Example: hidden Markov model (HMM)

X1 X2 X3 X4 X5 X6

Y1 Y2 Y3 Y4 Y5 Y6

Frequently used for speech recognition and part-of-speech tagging

Joint distribution factors as:

p(y, x) = p(y1)p(x1 | y1)T∏

t=2

p(yt | yt−1)p(xt | yt)

p(y1) is the initial distribution of the starting statep(yt | yt−1) is the transition probability between hidden statesp(xt | yt) is the emission probability

What are the conditional independencies here?

Many, e.g. Y1 ⊥ {Y3, . . . ,Y6} | Y2

21 / 29

Page 24: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Example: hidden Markov model (HMM)

X1 X2 X3 X4 X5 X6

Y1 Y2 Y3 Y4 Y5 Y6

Frequently used for speech recognition and part-of-speech tagging

Joint distribution factors as:

p(y, x) = p(y1)p(x1 | y1)T∏

t=2

p(yt | yt−1)p(xt | yt)

p(y1) is the initial distribution of the starting statep(yt | yt−1) is the transition probability between hidden statesp(xt | yt) is the emission probability

What are the conditional independencies here?Many, e.g. Y1 ⊥ {Y3, . . . ,Y6} | Y2

21 / 29

Page 25: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Summary

A Bayesian network specifies the global distribution by a DAGand local conditional probability distributions (CPDs) for eachnode

Can interpret as a generative model, where variables are sampledin topological order

Examples: naive Bayes, hidden Markov models (HMMs), latentDirichlet allocation

Conditional independence via d-separation

Compute the probability of any assignment by multiplying CPDs

Maximum likelihood learning of CPDs is easy (decomposes, canestimate each CPD separately)

22 / 29

Page 26: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Undirected graphical models

An alternative representation for a joint distribution is an undirectedgraphical model

As for directed models, we have one node for each random variable

Rather than CPDs, we specify (non-negative) potential functions oversets of variables associated with cliques C of the graph,

p(x1, . . . , xn) =1

Z

c∈C

φc(xc)

Z is the partition function and normalizes the distribution:

Z =∑

x1,...,xn

c∈C

φc(xc)

Like a CPD, φc(xc) can be represented as a table, but it is notnormalized

Also known as Markov random fields (MRFs) or Markov networks

23 / 29

Page 27: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Undirected graphical models

p(x1, . . . , xn) =1

Z

c∈C

φc(xc), Z =∑

x1,...,xn

c∈C

φc(xc)

Simple example (each edge potential function encourages its variables to takethe same value):

B

A C

10 1

1 10A

B0 1

0

1

φA,B(a, b) =

10 1

1 10B

C0 1

0

1

φB,C(b, c) = φA,C(a, c) =

10 1

1 10A

C0 1

0

1

p(a, b, c) =1

ZφA,B(a, b) · φB,C (b, c) · φA,C (a, c),

where

Z =∑

a,b,c∈{0,1}3φA,B(a, b) · φB,C (b, c) · φA,C (a, c) = 2 · 1000 + 6 · 10 = 2060.

24 / 29

Page 28: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Markov network structure ⇒ conditional independencies

Let G be the undirected graph where we have one edge for everypair of variables that appear together in a potential

Conditional independence is given by graph separation

XA

XB

XC

XA ⊥ XC | XB if there is no path from a ∈ A to c ∈ C afterremoving all variables in B

25 / 29

Page 29: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Markov blanket

A set U is a Markov blanket of X if X /∈ U and if U is a minimalset of nodes such that X ⊥ (X − {X} −U) | U

In undirected graphical models, the Markov blanket of a variable isprecisely its neighbors in the graph:

X

In other words, X is independent of the rest of the nodes in thegraph given its immediate neighbors

26 / 29

Page 30: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Directed and undirected models are different

27 / 29

Page 31: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Directed and undirected models are different

With <3 edges,

27 / 29

Page 32: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Example: Ising model

Invented by the physicist Wilhelm Lenz (1920), who gave it as a problemto his student Ernst Ising

Mathematical model of ferromagnetism in statistical mechanics

The spin of an atom is influenced by the spins of atoms nearby on thematerial:

= +1

= -1

Each atom Xi ∈ {−1,+1}, whose value is the direction of the atom spin

If a spin at position i is +1, what is the probability that the spin atposition j is also +1?

Are there phase transitions where spins go from “disorder” to “order”?

28 / 29

Page 33: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Example: Ising model

Each atom Xi ∈ {−1,+1}, whose value is the direction of the atom spin

The spin of an atom is influenced by the spins of atoms nearby on thematerial:

= +1

= -1

p(x1, · · · , xn) =1

Zexp

1

T

(∑

i<j

wi,jxixj +∑

i

θixi)

When wi,j > 0, adjacent atoms encouraged to have the same spin(attractive or ferromagnetic); wi,j < 0 encourages Xi 6= Xj

Node potentials θi encode the bias of the individual atoms

Varying the temperature T makes the distribution more or less spiky

29 / 29

Page 34: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Supplementary material

Extra slides for questions or further explanation

30 / 29

Page 35: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Basic idea

Suppose we have a simple chain A→ B → C → D,we want to compute p(D)

p(D) is a set of values, {p(D = d), d ∈ Val(D)}. Algorithmcomputes sets of values at a time – an entire distribution

The joint distribution factors as

p(A,B,C ,D) = p(A)p(B | A)p(C | B)p(D | C )

In order to compute p(D), we have to marginalize over A,B,C :

p(D) =∑

a,b,c

p(A = a,B = b,C = c,D)

31 / 29

Page 36: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

How can we perform the sum efficiently?

Our goal is to compute

p(D) =∑

a,b,c

p(a, b, c ,D) =∑

a,b,c

p(a)p(b | a)p(c | b)p(D | c)

=∑

c

b

a

p(D | c)p(c | b)p(b | a)p(a)

We can push the summations inside to obtain:

p(D) =∑

c

p(D | c)∑

b

p(c | b)∑

a

p(b | a)p(a)︸ ︷︷ ︸ψ1(a,b)︸ ︷︷ ︸τ1(b)

Let’s call ψ1(A,B) = P(A)P(B|A). Then, τ1(B) =∑

a ψ1(a,B)

Similarly, let ψ2(B,C ) = τ1(B)P(C |B). Then, τ2(C ) =∑

b ψ1(b,C )

This procedure is dynamic programming: computation is inside outinstead of outside in

32 / 29

Page 37: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Inference in a chain

Generalizing the previous example, suppose we have a chainX1 → X2 → · · · → Xn, where each variable has k states

For i = 1 up to n − 1, compute (and cache)

p(Xi+1) =∑

xi

p(Xi+1 | xi )p(xi )

Each update takes k2 time (why?)

The total running time is O(nk2)

In comparison, naively marginalizing over all latent variables hascomplexity O(kn)

We did inference over the joint without ever explicitly constructing it!

33 / 29

Page 38: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

ML learning in Bayesian networks

Maximum likelihood learning: maxθ `(θ;D), where

`(θ;D) = log p(D; θ) =∑

x∈Dlog p(x; θ)

=∑

i

xpa(i)

x∈D:xpa(i)=xpa(i)

log p(xi | xpa(i))

In Bayesian networks, we have the closed form ML solution:

θMLxi |xpa(i) =

Nxi ,xpa(i)∑xiNxi ,xpa(i)

where Nxi ,xpa(i) is the number of times that the (partial) assignmentxi , xpa(i) is observed in the training data

We can estimate each CPD independently because the objectivedecomposes by variable and parent assignment

34 / 29

Page 39: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Parameter learning in Markov networks

How do we learn the parameters of an Ising model?

= +1

= -1

p(x1, · · · , xn) =1

Zexp

(∑

i<j

wi,jxixj +∑

i

θixi)

35 / 29

Page 40: Directed and Undirected Graphical Modelsmlg.eng.cam.ac.uk/adrian/MLSALT4-AW2-models.pdf · 2. Directed and undirected graphical models (today) 3. Junction tree algorithm for exact

Bad news for Markov networks

The global normalization constant Z (θ) kills decomposability:

θML = arg maxθ

log∏

x∈Dp(x; θ)

= arg maxθ

x∈D

(∑

c

log φc(xc ; θ)− logZ (θ)

)

= arg maxθ

(∑

x∈D

c

log φc(xc ; θ)

)− |D| logZ (θ)

The log-partition function prevents us from decomposing theobjective into a sum over terms for each potential

Solving for the parameters becomes much more complicated

36 / 29