Modeling Rich Structured Data via Kernel Distribution...

Post on 07-Apr-2020

2 views 0 download

Transcript of Modeling Rich Structured Data via Kernel Distribution...

Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012

Le Song

Lecture 19, Nov 1, 2012

Reading: Chap 8, C. Bishop Book

Inference in Graphical Models

Conditional Independence Assumptions

Global Markov Assumption

𝐴 βŠ₯ 𝐡|𝐢, 𝑠𝑒𝑝𝐺 𝐴, 𝐡; 𝐢

2

Local Markov Assumption

𝑋 βŠ₯ π‘π‘œπ‘›π‘‘π‘’π‘ π‘π‘’π‘›π‘‘π‘Žπ‘›π‘‘π‘‹|π‘ƒπ‘Žπ‘‹

𝐴 𝐢 𝐡 𝑋

π‘ƒπ‘Žπ‘‹

π‘π‘œπ‘›π‘‘π‘’π‘ π‘π‘’π‘›π‘‘π‘Žπ‘›π‘‘π‘‹

𝐡𝑁 𝑀𝑁

𝑃

𝐡𝑁 𝑀𝑁

Moralize

Triangulate

Undirected Tree Undirected Chordal Graph

Distribution Factorization

Bayesian Networks (Directed Graphical Models) 𝐼 βˆ’ π‘šπ‘Žπ‘: 𝐼𝑙 𝐺 βŠ† 𝐼 𝑃

⇔

𝑃(𝑋1, … , 𝑋𝑛) = 𝑃(𝑋𝑖 | π‘ƒπ‘Žπ‘‹π‘–)

𝑛

𝑖=1

3

Markov Networks (Undirected Graphical Models) π‘ π‘‘π‘Ÿπ‘–π‘π‘‘π‘™π‘¦ π‘π‘œπ‘ π‘–π‘‘π‘–π‘£π‘’ 𝑃, 𝐼 βˆ’ π‘šπ‘Žπ‘: 𝐼 𝐺 βŠ† 𝐼 𝑃

⇔

𝑃(𝑋1, … , 𝑋𝑛) = 1

𝑍 Ψ𝑖 𝐷𝑖

π‘š

𝑖=1

𝑍 = Ψ𝑖 𝐷𝑖

π‘š

𝑖=1π‘₯1,π‘₯2,…,π‘₯𝑛

Clique Potentials

Conditional Probability Tables (CPTs)

Maximal Clique Normalization

(Partition Function)

Inference in Graphical Models

Graphical models give compact representations of probabilistic distributions 𝑃 𝑋1, … , 𝑋𝑛 (n-way tables to much smaller tables)

How do we answer queries about 𝑃?

Compute likelihood

Compute conditionals

Compute maximum a posteriori assignment

We use inference as a name for the process of computing answers to such queries

4

Most queries involve evidence

Evidence 𝑒 is an assignment of values to a set 𝐸 variables

Evidence are observations on some variables

Without loss of generality 𝐸 = π‘‹π‘˜+1, … , 𝑋𝑛

Simplest query: compute probability of evidence

𝑃 𝑒 = … 𝑃(π‘₯1, … , π‘₯π‘˜ , 𝑒)π‘₯π‘˜π‘₯1

This is often referred to as computing the likelihood of 𝑒

Query Type 1: Likelihood

5

𝐸

Sum over this set of variables

Query Type 2: Conditional Probability

Often we are interested in the conditional probability distribution of a variable given the evidence

𝑃 𝑋 𝑒 =𝑃 𝑋, 𝑒

𝑃 𝑒=𝑃(𝑋, 𝑒)

𝑃(𝑋 = π‘₯, 𝑒)π‘₯

It is also called a posteriori belief in 𝑋 given evidence 𝑒

We usually query a subset π‘Œ of all variables 𝒳 = {π‘Œ, 𝑍, 𝑒} and β€œdon’t care” about the remaining 𝑍

𝑃 π‘Œ 𝑒 = 𝑃(π‘Œ, 𝑍 = 𝑧|𝑒)

𝑧

Take all possible configuration of 𝑍 into account

The processes of summing out the unwanted variable Z is called marginalization

6

Query Type 2: Conditional Probability Example

7

𝐸

Sum over this set of variables

𝐸 Sum over this set of variables

Interested in the conditionals for these variables

Interested in the conditionals for these variables

Prediction: what is the probability of an outcome given the starting condition

The query node is a descendent of the evidence

Diagnosis: what is the probability of disease/fault given symptoms

The query node is an ancestor of the evidence

Learning under partial observations (Fill in the unobserved)

Information can flow in either direction

Inference can combine evidence from all parts of the networks

Application of a posteriori Belief

8

𝐴 𝐡 𝐢

𝐴 𝐡 𝐢

Query Type 3: Most Probable Assignment

Want to find the most probably joint assignment for some variables of interests

Such reasoning is usually performed under some given evidence 𝑒, and ignoring (the values of other variables) 𝑍

Also called maximum a posteriori (MAP) assignment for π‘Œ

𝑀𝐴𝑃 π‘Œ 𝑒 = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯𝑦 𝑃 π‘Œ 𝑒 = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯𝑦 𝑃 π‘Œ, 𝑍 = 𝑧 𝑒 𝑧

9

𝐸

Sum over this set of variables

Interested in the most probable values for these variables

Application of MAP assignment

Classification

Find most likely label, given the evidence

Explanation

What is the most likely scenario, given the evidence

Cautionary note:

The MAP assignment of a variable dependence on its context – the set of variables being jointly queried

Example:

MAP of 𝑋, π‘Œ ?

(0, 0)

MAP of 𝑋?

1

10

X Y P(X,Y)

0 0 0.35

0 1 0.05

1 0 0.3

1 1 0.3

X P(X)

0 0.4

1 0.6

Computing the a posteriori belief 𝑃 𝑋 𝑒 in a GM is NP-hard in general

Hardness implies we cannot find a general procedure that works efficiently for arbitrary GMs

For particular families of GMs, we can have provably efficient procedures

For some families of GMs, we need to design efficient approximate inference algorithms

Complexity of Inference

11

eg. trees

eg. grids

Approaches to inference

Exact inference algorithms

Variable elimination algorithm

Message-passing algorithm (sum-product, belief propagation algorithm)

The junction tree algorithm

Approximate inference algorithms

Sampling methods/Stochastic simulation

Variational algorithms

12

Marginalization and Elimination

A metabolic pathway: What is the likelihood protein 𝐸 is produced

Query: P(E)

𝑃 𝐸 = 𝑃 π‘Ž, 𝑏, 𝑐, 𝑑, πΈπ‘Žπ‘π‘π‘‘

Using graphical models, we get

𝑃 𝐸 = 𝑃 π‘Ž)𝑃 𝑏 π‘Ž 𝑃 𝑐 𝑏 𝑃 𝑑 𝑐 𝑃(𝐸|π‘‘π‘Žπ‘π‘π‘‘

13

𝐴 𝐡 𝐢 𝐷 𝐸

NaΓ―ve summation needs to enumerate over an

exponential number of terms

Rearranging terms and the summations

𝑃 𝐸

= 𝑃 π‘Ž)𝑃 𝑏 π‘Ž 𝑃 𝑐 𝑏 𝑃 𝑑 𝑐 𝑃(𝐸|𝑑

π‘Žπ‘π‘π‘‘

= 𝑃 𝑐 𝑏 𝑃 𝑑 𝑐 𝑃 𝐸 𝑑 𝑃 π‘Ž 𝑃 𝑏 π‘Ž

π‘Žπ‘π‘π‘‘

Elimination in Chains

14

𝐴 𝐡 𝐢 𝐷 𝐸

Elimination in Chains (cont.)

Now we can perform innermost summation efficiently

𝑃 𝐸

= 𝑃 𝑐 𝑏 𝑃 𝑑 𝑐 𝑃 𝐸 𝑑 𝑃 π‘Ž 𝑃 𝑏 π‘Ž

π‘Žπ‘π‘π‘‘

= 𝑃 𝑐 𝑏 𝑃 𝑑 𝑐 𝑃 𝐸 𝑑 𝑃(𝑏)

𝑏𝑐𝑑

The innermost summation eliminates one variable from our summation argument at a local cost.

15

𝐴 𝐡 𝐢 𝐷 𝐸

𝑃(𝑏)

Equivalent to matrix-vector multiplication, |Val(A)| * |Val(B)|

Elimination in Chains (cont.)

Rearranging and then summing again, we get

𝑃 𝐸

= 𝑃 𝑐 𝑏 𝑃 𝑑 𝑐 𝑃 𝑒 𝑑 𝑃(𝑏)

𝑏𝑐𝑑

= 𝑃 𝑑 𝑐 𝑃 𝐸 𝑑 𝑃 𝑐 𝑏 𝑃 𝑏

𝑏𝑐𝑑

= 𝑃 𝑑 𝑐 𝑃 𝐸 𝑑 𝑃(𝑐)

𝑐𝑑

16

𝐴 𝐡 𝐢 𝐷 𝐸

𝑃(𝑏) 𝑃(𝑐)

Equivalent to matrix-vector multiplication, |Val(B)| * |Val(C)|

C B 0 1

0 0 .15 0.35

1 0.85 0.65

B 0

0 0 .25

1 0.75

Elimination in Chains (cont.)

Eliminate nodes one by one all the way to the end

𝑃 𝐸 = 𝑃 𝐸 𝑑 𝑃(𝑑)

𝑑

Computational Complexity for a chain of length π‘˜

Each step costs O(|Val(𝑋𝑖)| * |Val(𝑋𝑖+1)|) operations: O(π‘˜π‘›2)

Ξ¨ 𝑋𝑖 = 𝑃 𝑋𝑖 π‘‹π‘–βˆ’1)𝑃(π‘‹π‘–βˆ’1)π‘₯π‘–βˆ’1

Compare to naΓ―ve summation: O(π‘›π‘˜)

… 𝑃(π‘₯1, … , π‘‹π‘˜)π‘₯π‘˜βˆ’1π‘₯1

17

𝐴 𝐡 𝐢 𝐷 𝐸

𝑃(𝑏) 𝑃(𝑐)

Undirected Chains

18

𝐴 𝐡 𝐢 𝐷 𝐸

Rearrange terms, perform local summation …

𝑃 𝐸

= 1

𝑍Ψ 𝑏, π‘Ž Ξ¨ 𝑐, 𝑏 Ξ¨ 𝑑, 𝑐 Ξ¨(𝐸, 𝑑)

π‘Žπ‘π‘π‘‘

=1

𝑍 Ξ¨ 𝑐, 𝑏 Ξ¨ 𝑑, 𝑐 Ξ¨ 𝐸, 𝑑 Ξ¨ 𝑏, π‘Ž

π‘Žπ‘π‘π‘‘

=1

𝑍 Ξ¨ 𝑐, 𝑏 Ξ¨ 𝑑, 𝑐 Ξ¨ 𝐸, 𝑑 Ξ¨ 𝑏

𝑏𝑐𝑑

The Sum-Product Operation

During inference, we try to compute an expression

Sum-product form: Ξ¨Ξ¨βˆˆπ“•π‘

𝓧 = {𝑋1, … , 𝑋𝑛} the set of variables

𝓕 a set of factors such that for each Ξ¨ ∈ 𝓕, π‘†π‘π‘œπ‘π‘’ Ξ¨ ∈ 𝓧

𝓨 βŠ‚ 𝓧 a set of query variables

𝓩 = π“§βˆ’π“¨ the variables to eliminate

The result of eliminating the variables in 𝓩 is a factor

𝜏 𝓨 = Ξ¨

Ξ¨βˆˆπ“•π‘§

This factor does not necessarily correspond to any probability or conditional probability in the network.

𝑃 𝓨 =𝜏(𝓨)

𝜏(𝓨)

19

Inference via Variable Elimination

General Idea

Write query in the form

𝑃 𝑋1, 𝑒 = … 𝑃 π‘₯𝑖 π‘ƒπ‘Žπ‘‹π‘–π‘–π‘₯2π‘₯3π‘₯𝑛

The sum is ordered to suggest an elimination order

Then iteratively

Move all irrelevant terms outside of innermost sum

Perform innermost sum, getting a new term

Insert the new term into the product

Finally renormalize

𝑃 𝑋1 𝑒 = 𝜏 𝑋1, 𝑒

𝜏(𝑋1, 𝑒)π‘₯1

20

A more complex network

A food web

What is the probability 𝑃 𝐴 𝐻 that hawks are leaving given that the grass condition is poor?

21

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

Query: 𝑃(𝐴|β„Ž), need to eliminate 𝐡, 𝐢, 𝐷, 𝐸, 𝐹, 𝐺, 𝐻

Initial factors 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 π‘Ž 𝑃 𝑔 𝑒 𝑃 β„Ž 𝑒, 𝑓

Choose an elimination order: 𝐻, 𝐺, 𝐹, 𝐸, 𝐷, 𝐢, 𝐡 (<)

Step 1: Eliminate G

Conditioning (fix the evidence node on its observed value)

π‘šβ„Ž 𝑒, 𝑓 = 𝑃(𝐻 = β„Ž|𝑒, 𝑓)

Example: Variable Elimination

22

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺

Query: 𝑃(𝐴|β„Ž), need to eliminate 𝐡, 𝐢, 𝐷, 𝐸, 𝐹, 𝐺

Initial factors 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 π‘Ž 𝑃 𝑔 𝑒 𝑃 β„Ž 𝑒, 𝑓

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 π‘Ž 𝑃 𝑔 𝑒 π‘šβ„Ž(𝑒, 𝑓)

Step 2: Eliminate 𝐺

Compute π‘šπ‘” 𝑒 = 𝑃 𝑔 𝑒 𝑔 = 1

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 π‘Ž π‘šπ‘” 𝑒 π‘šβ„Ž(𝑒, 𝑓)

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 π‘Ž π‘šβ„Ž(𝑒, 𝑓)

Example: Variable Elimination

23

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

Query: 𝑃(𝐴|β„Ž), need to eliminate 𝐡, 𝐢, 𝐷, 𝐸, 𝐹

Initial factors 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 π‘Ž 𝑃 𝑔 𝑒 𝑃 β„Ž 𝑒, 𝑓

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 π‘Ž 𝑃 𝑔 𝑒 π‘šβ„Ž 𝑒, 𝑓

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 π‘Ž π‘šβ„Ž 𝑒, 𝑓

Step 3: Eliminate 𝐹

Compute π‘šπ‘“ 𝑒, π‘Ž = 𝑃 𝑓 π‘Ž π‘šβ„Ž(𝑒, 𝑓) 𝑓

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 π‘šπ‘“(𝑒, π‘Ž)

Example: Variable Elimination

24

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

𝐴 𝐡

𝐢 𝐷

𝐸

Query: 𝑃(𝐴|β„Ž), need to eliminate 𝐡, 𝐢, 𝐷, 𝐸

Initial factors 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 π‘Ž 𝑃 𝑔 𝑒 𝑃 β„Ž 𝑒, 𝑓

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 π‘Ž 𝑃 𝑔 𝑒 π‘šβ„Ž 𝑒, 𝑓

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 π‘Ž π‘šβ„Ž 𝑒, 𝑓

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 π‘šπ‘“ π‘Ž, 𝑒

Step 3: Eliminate 𝐸

Compute π‘šπ‘’ π‘Ž, 𝑐, 𝑑 = 𝑃 𝑒 𝑐, 𝑑 π‘šπ‘“(π‘Ž, 𝑒) 𝑒

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž π‘šπ‘’(π‘Ž, 𝑐, 𝑑)

Example: Variable Elimination

25

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

𝐴 𝐡

𝐢 𝐷

Query: 𝑃(𝐴|β„Ž), need to eliminate 𝐡, 𝐢, 𝐷

Initial factors 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 π‘Ž 𝑃 𝑔 𝑒 𝑃 β„Ž 𝑒, 𝑓

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 π‘Ž 𝑃 𝑔 𝑒 π‘šβ„Ž 𝑒, 𝑓

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 π‘Ž π‘šβ„Ž 𝑒, 𝑓

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 π‘šπ‘“ π‘Ž, 𝑒

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž π‘šπ‘’ π‘Ž, 𝑐, 𝑑

Step 3: Eliminate 𝐷

Compute π‘šπ‘‘ π‘Ž, 𝑐 = 𝑃 𝑑 π‘Ž π‘šπ‘’(π‘Ž, 𝑐, 𝑑) 𝑑

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 π‘šπ‘‘(π‘Ž, 𝑐)

Example: Variable Elimination

26

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

𝐴 𝐡

𝐢

Query: 𝑃(𝐴|β„Ž), need to eliminate 𝐡, 𝐢

Initial factors 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 π‘Ž 𝑃 𝑔 𝑒 𝑃 β„Ž 𝑒, 𝑓

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 π‘Ž 𝑃 𝑔 𝑒 π‘šβ„Ž 𝑒, 𝑓

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 π‘Ž π‘šβ„Ž 𝑒, 𝑓

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 π‘šπ‘“ π‘Ž, 𝑒

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž π‘šπ‘’ π‘Ž, 𝑐, 𝑑

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 π‘šπ‘‘ π‘Ž, 𝑐

Step 3: Eliminate 𝐢

Compute π‘šπ‘ π‘Ž, 𝑏 = 𝑃 𝑐 𝑏 π‘šπ‘‘(π‘Ž, 𝑐) 𝑐

β‡’ 𝑃 π‘Ž 𝑃 𝑏 π‘šπ‘(π‘Ž, 𝑏)

Example: Variable Elimination

27

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

𝐴 𝐡

𝐢

Query: 𝑃(𝐴|β„Ž), need to eliminate 𝐡

Initial factors 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 π‘Ž 𝑃 𝑔 𝑒 𝑃 β„Ž 𝑒, 𝑓

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 π‘Ž 𝑃 𝑔 𝑒 π‘šβ„Ž 𝑒, 𝑓

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 π‘Ž π‘šβ„Ž 𝑒, 𝑓

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 π‘šπ‘“ π‘Ž, 𝑒

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž π‘šπ‘’ π‘Ž, 𝑐, 𝑑

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 π‘šπ‘‘ π‘Ž, 𝑐

β‡’ 𝑃 π‘Ž 𝑃 𝑏 π‘šπ‘ π‘Ž, 𝑏

Step 3: Eliminate 𝐢

Compute π‘šπ‘ π‘Ž = 𝑃(𝑏)π‘šπ‘(π‘Ž, 𝑏) 𝑏

β‡’ 𝑃 π‘Ž π‘šπ‘(π‘Ž)

Example: Variable Elimination

28

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

𝐴 𝐡

𝐢

Query: 𝑃(𝐴|β„Ž), need to renormalize over 𝐴

Initial factors 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 π‘Ž 𝑃 𝑔 𝑒 𝑃 β„Ž 𝑒, 𝑓

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 π‘Ž 𝑃 𝑔 𝑒 π‘šβ„Ž 𝑒, 𝑓

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 π‘Ž π‘šβ„Ž 𝑒, 𝑓

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž 𝑃 𝑒 𝑐, 𝑑 π‘šπ‘“ π‘Ž, 𝑒

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 π‘Ž π‘šπ‘’ π‘Ž, 𝑐, 𝑑

β‡’ 𝑃 π‘Ž 𝑃 𝑏 𝑃 𝑐 𝑏 π‘šπ‘‘ π‘Ž, 𝑐

β‡’ 𝑃 π‘Ž 𝑃 𝑏 π‘šπ‘ π‘Ž, 𝑏

β‡’ 𝑃 π‘Ž π‘šπ‘ π‘Ž

Step 3: renormalize

𝑃 π‘Ž, β„Ž = 𝑃 π‘Ž π‘šπ‘ π‘Ž , compute 𝑃(β„Ž) = 𝑃 π‘Ž π‘šπ‘(π‘Ž)π‘Ž

β‡’ 𝑃 π‘Ž β„Ž = 𝑃 π‘Ž π‘šπ‘(π‘Ž)

𝑃 π‘Ž π‘šπ‘(𝐴)π‘Ž

Example: Variable Elimination

29

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

𝐴 𝐡

𝐢

Complexity of variable elimination

Suppose in one elimination step we compute

π‘šπ‘₯ 𝑦1, … , π‘¦π‘˜ = π‘šπ‘₯β€² (π‘₯, 𝑦1, … , π‘¦π‘˜)π‘₯

π‘šπ‘₯β€² π‘₯, 𝑦1, … , π‘¦π‘˜ = π‘šπ‘– π‘₯, 𝑦𝑐𝑖

π‘˜π‘–=1

This requires

π‘˜ βˆ— π‘‰π‘Žπ‘™ 𝑋 βˆ— π‘‰π‘Žπ‘™ π‘Œπ‘π‘–π‘– multiplications

For each value of π‘₯, 𝑦1, … , π‘¦π‘˜, we do k multiplications

π‘‰π‘Žπ‘™ 𝑋 βˆ— π‘‰π‘Žπ‘™ π‘Œπ‘π‘–π‘– additions

For each value of 𝑦1, … , π‘¦π‘˜, we do π‘‰π‘Žπ‘™ 𝑋 additions

Complexity is exponential in the number of variables in the intermediate factor

30

𝑋

𝑦1 π‘¦π‘˜ 𝑦𝑖

Inference in Graphical Models

General form of the inference problem

𝑃 𝑋1, … , 𝑋𝑛 ∝ Ξ¨(𝐷𝑖)𝑖

Want to query π‘Œ variable given evidence 𝑒, and β€œdon’t care” a set of 𝑍 variables

Compute 𝜏 π‘Œ, 𝑒 = Ξ¨(𝐷𝑖)𝑖𝑍 using variable elimination

Renormalize to obtain the conditionals 𝑃 π‘Œ|𝑒 =𝜏(π‘Œ,𝑒)

𝜏(π‘Œ,𝑒)π‘Œ

Two examples: use graph structure

to order computation

31

𝐴 𝐡 𝐢 𝐷 𝐸

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

Chain:

DAG:

From Variable Elimination to Message Passing

Recall that induced dependency during marginalization is captured in elimination cliques

Summation Elimination

Intermediate term Elimination cliques

Can this lead to an generic inference algorithm?

32

Nice localization in computation

𝑃 𝐸 = 𝑃 π‘Ž)𝑃 𝑏 π‘Ž 𝑃 𝑐 𝑏 𝑃 𝑑 𝑐 𝑃(𝐸|π‘‘π‘Žπ‘π‘π‘‘

𝑃 𝐸 = 𝑃 𝐸 𝑑 𝑃 𝑑 𝑐 ( 𝑃 𝑐 𝑏 𝑃 𝑏 π‘Ž 𝑃 π‘Ž)π‘Žπ‘π‘π‘‘

Chain: Query E

33

𝐴 𝐡 𝐢 𝐷 𝐸

π‘šπ΄π΅ 𝑏

π‘šπ΅πΆ 𝑐

π‘šπΆπ· 𝑑

𝑃 𝐸 = π‘šπ·πΈ 𝐸

π‘šπ΄π΅ 𝑏 π‘šπ΅πΆ 𝑐 π‘šπΆπ· 𝑑 π‘šπ·πΈ 𝐸

Start elimination away from the query variable

𝑃(𝐢) = 𝑃 π‘Ž)𝑃 𝑏 π‘Ž 𝑃 𝑐 𝑏 𝑃 𝑑 𝑐 𝑃(𝑒|π‘‘π‘Žπ‘π‘’π‘‘

𝑃(𝐢) = ( 𝑃 𝑑 𝐢 ( 𝑃(𝑒|𝑑))) ( 𝑃 𝐢 𝑏 ( 𝑃 𝑏 π‘Ž 𝑃 π‘Žπ‘Žπ‘ )𝑒𝑑 )

Chain: Query C

34

𝐴 𝐡 𝐢 𝐷 𝐸

π‘šπ΄π΅ 𝑏

π‘šπ΅πΆ 𝐢

π‘šπ·πΈ 𝑑

π‘šπ·πΆ 𝐢

𝑃 𝐢 = π‘šπ·πΆ 𝐢 π‘šπ΅πΆ(𝐢)

π‘šπ΄π΅ 𝑏 π‘šπ΅πΆ 𝐢 π‘šπ·πΆ 𝐢 π‘šπΈπ· 𝑑

Chain: What if I want to query everybody

𝑃 𝐡 = ( 𝑃 𝑐 𝐡 ( 𝑃 𝑑 𝑐𝑑𝑐 ( 𝑃 𝑒 𝑑 )))𝑒 𝑃 𝐡 π‘Ž 𝑃 π‘Žπ‘Ž

Query 𝑃 𝐴 , 𝑃 𝐡 , 𝑃 𝐢 , 𝑃 𝐷 , 𝑃(𝐸)

Computational cost

Each message 𝑂 𝐾2

Chain length is 𝐿

Cost for each query is about 𝑂 𝐿𝐾2

For 𝐿 queries, cost is about 𝑂 𝐿2𝐾2

35

𝐴 𝐡 𝐢 𝐷 𝐸

π‘šπ΄π΅ 𝐡 π‘šπΆπ΅ 𝐡 π‘šπ·πΆ 𝑐 π‘šπΈπ· 𝑑

What is shared in these queries?

𝑃 𝐡 = ( 𝑃 𝑐 𝐡 ( 𝑃 𝑑 𝑐𝑑𝑐 ( 𝑃 𝑒 𝑑 )))𝑒 𝑃 𝐡 π‘Ž 𝑃 π‘Žπ‘Ž

𝑃 𝐸 = 𝑃 𝐸 𝑑 𝑃 𝑑 𝑐 ( 𝑃 𝑐 𝑏 𝑃 𝑏 π‘Ž 𝑃 π‘Ž)π‘Žπ‘π‘π‘‘

𝑃 𝐢 = ( 𝑃 𝑑 𝐢 ( 𝑃(𝑒|𝑑))) ( 𝑃 𝐢 𝑏 ( 𝑃 𝑏 π‘Ž 𝑃 π‘Žπ‘Žπ‘ )𝑒𝑑 )

36

𝐴 𝐡 𝐢 𝐷 𝐸

π‘šπ΄π΅ 𝑏 π‘šπ΅πΆ 𝑐 π‘šπΆπ· 𝑑 π‘šπ·πΈ 𝐸

𝐴 𝐡 𝐢 𝐷 𝐸

π‘šπ΄π΅ 𝑏 π‘šπ΅πΆ 𝐢 π‘šπ·πΆ 𝐢 π‘šπΈπ· 𝑑

𝐴 𝐡 𝐢 𝐷 𝐸

π‘šπ΄π΅ 𝐡 π‘šπΆπ΅ 𝐡 π‘šπ·πΆ 𝑐 π‘šπΈπ· 𝑑

The number of unique message is 2(𝐿 βˆ’ 1)

Forward-backward algorithm

Compute and cache the 2(𝐿 βˆ’ 1) unique messages

In query time, just multiply together the messages from the neighbors

eg. 𝑃 𝐷 = π‘šπΆπ· 𝐷 π‘šπΈπ·(𝐷)

37

𝐴 𝐡 𝐢 𝐷 𝐸

π‘šπ΄π΅ 𝑏 π‘šπ΅πΆ 𝑐 π‘šπΆπ· 𝑑 π‘šπ·πΈ 𝑒

Forward pass:

𝐴 𝐡 𝐢 𝐷 𝐸

π‘šπ΅π΄ π‘Ž π‘šπΆπ΅ 𝑏 π‘šπ·πΆ 𝑐 π‘šπΈπ· 𝑑

Backward pass:

𝐴 𝐡 𝐢 𝐷 𝐸

π‘šπΆπ· 𝐷 π‘šπΈπ· 𝐷 For all queries, 𝑂 2𝐿𝐾2

DAG: Variable elimination

Elimination order H, G, F, E, B, C, D

𝑃 𝐴 =

𝑃 𝐴 𝑃 𝑑 𝐴 ( ( 𝑃 𝑏 𝑃 𝑐 𝑏 )( 𝑃 𝑒 𝑐, 𝑑 ( 𝑃 𝑔 𝑒 )( 𝑃 𝑓 𝐴 𝑃 β„Ž 𝑒, 𝑓 ))) β„Ž 𝑓𝑔 𝑒𝑏𝑐𝑑

38

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

π‘šπ»(𝐸𝐹) 𝑒, 𝑓

π‘šπΉ(𝐴𝐸) 𝐴, 𝑒

π‘šπΊπΈ 𝑒

π‘šπΈ(𝐴𝐢𝐷) 𝐴, 𝑐, 𝑑

π‘šπ΅πΆ 𝑐

π‘šπΆ(𝐴𝐷) 𝐴, 𝑑

π‘šπ·π΄ 𝐴

4-way tables

created!

DAG: Cliques of size 4 are generated

39

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

π‘šπ»(𝐸𝐹) 𝑒, 𝑓

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

π‘šπΊπΈ 𝑒

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

π‘šπΉ(𝐴𝐸) 𝐴, 𝑒

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

π‘šπΈ(𝐴𝐢𝐷) 𝐴, 𝑐, 𝑑

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

π‘šπ΅πΆ 𝑐

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

π‘šπΆ(𝐴𝐷) 𝐴, 𝑑

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

π‘šπ·π΄ 𝐴

4-way tables

created!

DAG: A different elimination order

Elimination order G, H, F, B, C, D, E

𝑃 𝐴

= ( 𝑃(𝑑|𝐴)𝑑 𝑃(𝑒|𝑐, 𝑑)𝑐 𝑃 𝑏 𝑃 𝑐 𝑏𝑏 𝑃 𝑓 𝐴 𝑃 β„Ž 𝑒, π‘“β„Žπ‘“ 𝑃 𝑔 𝑒𝑔 )𝑒

40

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

π‘šπΊπΈ 𝑒

π‘šπΉ(𝐴𝐸) 𝐴, 𝑒

π‘šπ»(𝐸𝐹) 𝑒, 𝑓

π‘šπΆ(𝐸𝐷) 𝑒, 𝑑

π‘šπ΅πΆ 𝑐

π‘šπΈπ΄ 𝐴

π‘šπ·(𝐴𝐸) 𝐴, 𝑒

NO 4-way tables!

DAG: No cliques of size 4

41

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

π‘šπΊπΈ 𝑒

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

π‘šπ»(𝐸𝐹) 𝑒, 𝑓

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

π‘šπΉ(𝐴𝐸) 𝐴, 𝑒

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

π‘šπ΅πΆ 𝑐

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

π‘šπΆ(𝐷𝐸) 𝑑, 𝑒

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

π‘šπ·(𝐴𝐸) 𝐴, 𝑒

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

π‘šπΈπ΄ 𝐴

Any thoughts?

Chain has nice properties

forward-backward algorithm works

Immediate results (messages) along edges

Can we generalize to other graphs? (trees, loopy graphs?)

How about undirected trees? Is there a forward-backward algorithm?

Loopy graph is more complicated Different elimination order results in different computational cost

Can we somehow make loopy graph behave like trees?

42

Tree Graphical Models

43

Undirected tree: a unique path between any pair of nodes

Directed tree: all nodes except the root have exactly one parent

Equivalence of directed and undirected trees

Any undirected tree can be converted to a directed tree by choosing a root node and directing all edges away from it

A directed tree and the corresponding undirected tree make the conditional independence assertions

Parameterization are essentially the same

Undirected tree: 𝑃 𝑋 =1

𝑍 Ξ¨ 𝑋𝑖 Ξ¨(𝑋𝑖 , 𝑋𝑗)(𝑖,𝑗)∈Eπ‘–βˆˆV

Directed tree: 𝑃 𝑋 = 𝑃 π‘‹π‘Ÿ 𝑃(𝑋𝑗|𝑋𝑖)𝑖,𝑗 ∈𝐸

Equivalence: Ξ¨ 𝑋𝑖 = 𝑃 π‘‹π‘Ÿ , Ξ¨ 𝑋𝑖 , 𝑋𝑗 = 𝑃 𝑋𝑗 𝑋𝑖 , 𝑍 =

1,Ξ¨ 𝑋𝑖 = 1

44

Message passing on trees

Message passed along tree edges

𝑃 𝑋𝑖, 𝑋𝑗 , π‘‹π‘˜ , 𝑋𝑙, 𝑋𝑓 ∝

Ξ¨ 𝑋𝑖 Ξ¨ 𝑋𝑗 Ξ¨ π‘‹π‘˜ Ξ¨ 𝑋𝑙 Ξ¨ 𝑋𝑓 Ξ¨ 𝑋𝑖 , 𝑋𝑗 Ξ¨ π‘‹π‘˜ , 𝑋𝑗 Ξ¨ 𝑋𝑙 , 𝑋𝑗 Ξ¨(𝑋𝑖 , 𝑋𝑓)

𝑃 𝑓 = Ξ¨(𝑋𝑓) (Ξ¨ 𝑋𝑖 Ξ¨ 𝑋𝑖 , 𝑋𝑓 Ξ¨ 𝑋𝑗 Ξ¨ 𝑋𝑖 , 𝑋𝑗 ( Ξ¨ π‘‹π‘˜ Ξ¨ π‘‹π‘˜ , 𝑋𝑗π‘₯π‘˜ )( Ξ¨ 𝑋𝑙 Ξ¨ 𝑋𝑙 , 𝑋𝑗π‘₯𝑙 )π‘₯𝑗 )π‘₯𝑖

45

𝑓 𝑖 𝑗

π‘˜

𝑙

π‘šπ‘˜π‘— 𝑋𝑗

π‘šπ‘™π‘— 𝑋𝑗

π‘šπ‘—π‘– 𝑋𝑖 π‘šπ‘–π‘“ 𝑋𝑓

π‘šπ‘™π‘— 𝑋𝑗 π‘šπ‘˜π‘— π‘‹π‘˜

π‘šπ‘—π‘– 𝑋𝑖

π‘šπ‘–π‘“ 𝑋𝑓

Sharing messages on trees

Query f

Query j

46

𝑓 𝑖 𝑗

π‘˜

𝑙

π‘šπ‘˜π‘— 𝑋𝑗

π‘šπ‘™π‘— 𝑋𝑗

π‘šπ‘—π‘– 𝑋𝑖 π‘šπ‘–π‘“ 𝑋𝑓

𝑓 𝑖 𝑗

π‘˜

𝑙

π‘šπ‘˜π‘— 𝑋𝑗

π‘šπ‘™π‘— 𝑋𝑗

π‘šπ‘–π‘— 𝑋𝑗 π‘šπ‘“π‘– 𝑋𝑖

Computational cost for all queries

Query 𝑃 π‘‹π‘˜ , 𝑃 𝑋𝑙 , 𝑃 𝑋𝑗 , 𝑃 𝑋𝑖 , 𝑃 𝑋𝑓

Doing things separately

Each message 𝑂 𝐾2

Number of edges is 𝐿

Cost for each query is about 𝑂 𝐿𝐾2

For 𝐿 queries, cost is about 𝑂 𝐿2𝐾2

47

𝑓 𝑖 𝑗

π‘˜

𝑙

π‘šπ‘˜π‘— 𝑋𝑗

π‘šπ‘™π‘— 𝑋𝑗

π‘šπ‘–π‘— 𝑋𝑗 π‘šπ‘“π‘– 𝑋𝑖

Forward-backward algorithm in trees

Forward: pick one leave as root, compute all messages, cache

Backward: pick another root, compute all messages, cache

Eg. Query j

48

𝑓 𝑖 𝑗

π‘˜

𝑙

π‘šπ‘˜π‘— 𝑋𝑗

π‘šπ‘™π‘— 𝑋𝑗

π‘šπ‘—π‘– 𝑋𝑖 π‘šπ‘–π‘“ 𝑋𝑓

𝑓 𝑖 𝑗

π‘˜

𝑙

π‘šπ‘—π‘˜ π‘‹π‘˜

π‘šπ‘™π‘— 𝑋𝑗

π‘šπ‘–π‘— 𝑋𝑗 π‘šπ‘–π‘“ 𝑋𝑓

𝑓 𝑖 𝑗

π‘˜

𝑙

π‘šπ‘˜π‘— 𝑋𝑗

π‘šπ‘™π‘— 𝑋𝑗

π‘šπ‘–π‘— 𝑋𝑗

resuse

Computational saving for trees

Compute forward and backward messages for each edge, save them

Doing things separately

Each message 𝑂 𝐾2

Number of edges is 𝐿

2𝐿 unique messages

Cost for all queries is about 𝑂 2𝐿𝐾2

49

𝑓 𝑖 𝑗

π‘˜

𝑙

π‘šπ‘˜π‘— 𝑋𝑗

π‘šπ‘™π‘— 𝑋𝑗

π‘šπ‘—π‘– 𝑋𝑖 π‘šπ‘–π‘“ 𝑋𝑓

π‘šπ‘“π‘– 𝑋𝑖 π‘šπ‘–π‘— 𝑋𝑗 π‘šπ‘—π‘˜ π‘‹π‘˜

π‘šπ‘—π‘™ 𝑋𝑙

Message passing algorithm

π‘šπ‘—π‘– 𝑋𝑖 ∝ Ξ¨ 𝑋𝑖 , 𝑋𝑗𝑋𝑗Ψ 𝑋𝑗 π‘šπ‘ π‘— π‘‹π‘—π‘ βˆˆN 𝑗 \i

50

𝑓 𝑖 𝑗

π‘˜

𝑙

π‘šπ‘˜π‘— 𝑋𝑗

π‘šπ‘™π‘— 𝑋𝑗

π‘šπ‘—π‘– 𝑋𝑖

N 𝑗 \i

π‘π‘Ÿπ‘œπ‘‘π‘’π‘π‘‘ π‘œπ‘“ π‘–π‘›π‘π‘œπ‘šπ‘–π‘›π‘” π‘šπ‘’π‘ π‘ π‘Žπ‘”π‘’π‘ 

π‘šπ‘’π‘™π‘‘π‘–π‘π‘™π‘¦ 𝑏𝑦 π‘™π‘œπ‘π‘Žπ‘™ π‘π‘œπ‘‘π‘’π‘›π‘‘π‘–π‘Žπ‘™π‘ 

π‘†π‘’π‘š π‘œπ‘’π‘‘ 𝑋𝑗 𝑋𝑗 can send

message when incoming messages from 𝑁 𝑗 \i arrive

From Variable Elimination to Message Passing

Recall Variable Elimination Algorithm

Choose an ordering in which the query node 𝑓 is the final node

Eliminate node 𝑖 by removing all potentials containing 𝑖, take sum/product over π‘₯𝑖

Place the resultant factor back

For a Tree graphical model:

Choose query node f as the root of the tree

View tree as a directed tree with edges pointing towards 𝑓

Elimination of each node can be considered as message-passing directly along tree branches, rather than on some transformed graphs

Thus, we can use the tree itself as a data-structure to inference

51

How about general graph?

Trees are nice

Can just compute two messages for each edge

Order computation along the graph

Associate intermediate results with edges

General graph is not so clear

Different elimination generate different cliques and factor size

Computation and immediate results not associated with edges

Local computation view is not so clear

52

𝑓 𝑖 𝑗

π‘˜

𝑙

π‘šπ‘˜π‘— 𝑋𝑗

π‘šπ‘™π‘— 𝑋𝑗

π‘šπ‘—π‘– 𝑋𝑖 π‘šπ‘–π‘“ 𝑋𝑓

π‘šπ‘“π‘– 𝑋𝑖 π‘šπ‘–π‘— 𝑋𝑗 π‘šπ‘—π‘˜ π‘‹π‘˜

π‘šπ‘—π‘™ 𝑋𝑙

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

𝐴 𝐡

𝐢 𝐷

𝐸 𝐹

𝐺 𝐻

Can we make them tree like or treat them

as trees?

Message passing for loopy graph

Local message passing for trees guarantees the consistency of local marginals

𝑃 𝑋𝑖 computed is the correct one

𝑃 𝑋𝑖 , 𝑋𝑗 computed is the correct on

…

For loopy graphs, no consistency guarantees for local message passing

53

𝑓 𝑖 𝑗

π‘˜

𝑙

π‘šπ‘˜π‘— 𝑋𝑗

π‘šπ‘™π‘— 𝑋𝑗

π‘šπ‘—π‘– 𝑋𝑖

Inference for loopy graph models is NP-hard in general

Treat loopy graphs locally as if they were trees

Iteratively estimate the marginal

Read in messages

Process messages

Send updated out messages

Repeat for all variables until convergence

Loopy belief propagation

54

A

Message update schedule

Synchronous update:

𝑋𝑗 can send message when incoming messages from 𝑁 𝑗 \i

arrive

Slow

Provably correct for tree, may converge for loopy graphs

Asynchronous update:

𝑋𝑗 can send message when there is a change in any incoming messages

from 𝑁 𝑗 \i

Fast

Not easy to prove convergence, but empirically it often works

55