Modeling Rich Structured Data via Kernel Distribution...

Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012

Le Song

Lecture 19, Nov 1, 2012

Reading: Chap 8, C. Bishop Book

Inference in Graphical Models

Conditional Independence Assumptions

Global Markov Assumption

𝐴 ⊥ 𝐵|𝐶, 𝑠𝑒𝑝𝐺 𝐴, 𝐵; 𝐶

2

Local Markov Assumption

𝑋 ⊥ 𝑁𝑜𝑛𝑑𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑋|𝑃𝑎𝑋

𝐴 𝐶 𝐵 𝑋

𝑃𝑎𝑋

𝑁𝑜𝑛𝑑𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑋

𝐵𝑁 𝑀𝑁

𝑃

𝐵𝑁 𝑀𝑁

Moralize

Triangulate

Undirected Tree Undirected Chordal Graph

Distribution Factorization

Bayesian Networks (Directed Graphical Models) 𝐼 − 𝑚𝑎𝑝: 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃

⇔

𝑃(𝑋1, … , 𝑋𝑛) = 𝑃(𝑋𝑖 | 𝑃𝑎𝑋𝑖)

𝑛

𝑖=1

3

Markov Networks (Undirected Graphical Models) 𝑠𝑡𝑟𝑖𝑐𝑡𝑙𝑦 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑃, 𝐼 − 𝑚𝑎𝑝: 𝐼 𝐺 ⊆ 𝐼 𝑃

⇔

𝑃(𝑋1, … , 𝑋𝑛) = 1

𝑍 Ψ𝑖 𝐷𝑖

𝑚

𝑖=1

𝑍 = Ψ𝑖 𝐷𝑖

𝑚

𝑖=1𝑥1,𝑥2,…,𝑥𝑛

Clique Potentials

Conditional Probability Tables (CPTs)

Maximal Clique Normalization

(Partition Function)


Graphical models give compact representations of probabilistic distributions 𝑃 𝑋1, … , 𝑋𝑛 (n-way tables to much smaller tables)

How do we answer queries about 𝑃?

Compute likelihood

Compute conditionals

Compute maximum a posteriori assignment

We use inference as a name for the process of computing answers to such queries

4

Most queries involve evidence

Evidence 𝑒 is an assignment of values to a set 𝐸 variables

Evidence are observations on some variables

Without loss of generality 𝐸 = 𝑋𝑘+1, … , 𝑋𝑛

Simplest query: compute probability of evidence

𝑃 𝑒 = … 𝑃(𝑥1, … , 𝑥𝑘 , 𝑒)𝑥𝑘𝑥1

This is often referred to as computing the likelihood of 𝑒

Query Type 1: Likelihood

5

𝐸

Sum over this set of variables

Query Type 2: Conditional Probability

Often we are interested in the conditional probability distribution of a variable given the evidence

𝑃 𝑋 𝑒 =𝑃 𝑋, 𝑒

𝑃 𝑒=𝑃(𝑋, 𝑒)

𝑃(𝑋 = 𝑥, 𝑒)𝑥

It is also called a posteriori belief in 𝑋 given evidence 𝑒

We usually query a subset 𝑌 of all variables 𝒳 = {𝑌, 𝑍, 𝑒} and “don’t care” about the remaining 𝑍

𝑃 𝑌 𝑒 = 𝑃(𝑌, 𝑍 = 𝑧|𝑒)

𝑧

Take all possible configuration of 𝑍 into account

The processes of summing out the unwanted variable Z is called marginalization

6

Query Type 2: Conditional Probability Example

7

𝐸


𝐸 Sum over this set of variables

Interested in the conditionals for these variables

Interested in the conditionals for these variables

Prediction: what is the probability of an outcome given the starting condition

The query node is a descendent of the evidence

Diagnosis: what is the probability of disease/fault given symptoms

The query node is an ancestor of the evidence

Learning under partial observations (Fill in the unobserved)

Information can flow in either direction

Inference can combine evidence from all parts of the networks

Application of a posteriori Belief

8

𝐴 𝐵 𝐶

𝐴 𝐵 𝐶

Query Type 3: Most Probable Assignment

Want to find the most probably joint assignment for some variables of interests

Such reasoning is usually performed under some given evidence 𝑒, and ignoring (the values of other variables) 𝑍

Also called maximum a posteriori (MAP) assignment for 𝑌

𝑀𝐴𝑃 𝑌 𝑒 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦 𝑃 𝑌 𝑒 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦 𝑃 𝑌, 𝑍 = 𝑧 𝑒 𝑧

9

𝐸


Interested in the most probable values for these variables

Application of MAP assignment

Classification

Find most likely label, given the evidence

Explanation

What is the most likely scenario, given the evidence

Cautionary note:

The MAP assignment of a variable dependence on its context – the set of variables being jointly queried

Example:

MAP of 𝑋, 𝑌 ?

(0, 0)

MAP of 𝑋?

1

10

X Y P(X,Y)

0 0 0.35

0 1 0.05

1 0 0.3

1 1 0.3

X P(X)

0 0.4

1 0.6

Computing the a posteriori belief 𝑃 𝑋 𝑒 in a GM is NP-hard in general

Hardness implies we cannot find a general procedure that works efficiently for arbitrary GMs

For particular families of GMs, we can have provably efficient procedures

For some families of GMs, we need to design efficient approximate inference algorithms

Complexity of Inference

11

eg. trees

eg. grids

Approaches to inference

Exact inference algorithms

Variable elimination algorithm

Message-passing algorithm (sum-product, belief propagation algorithm)

The junction tree algorithm

Approximate inference algorithms

Sampling methods/Stochastic simulation

Variational algorithms

12

Marginalization and Elimination

A metabolic pathway: What is the likelihood protein 𝐸 is produced

Query: P(E)

𝑃 𝐸 = 𝑃 𝑎, 𝑏, 𝑐, 𝑑, 𝐸𝑎𝑏𝑐𝑑

Using graphical models, we get

𝑃 𝐸 = 𝑃 𝑎)𝑃 𝑏 𝑎 𝑃 𝑐 𝑏 𝑃 𝑑 𝑐 𝑃(𝐸|𝑑𝑎𝑏𝑐𝑑

13

𝐴 𝐵 𝐶 𝐷 𝐸

Naïve summation needs to enumerate over an

exponential number of terms

Rearranging terms and the summations

𝑃 𝐸

= 𝑃 𝑎)𝑃 𝑏 𝑎 𝑃 𝑐 𝑏 𝑃 𝑑 𝑐 𝑃(𝐸|𝑑

𝑎𝑏𝑐𝑑

= 𝑃 𝑐 𝑏 𝑃 𝑑 𝑐 𝑃 𝐸 𝑑 𝑃 𝑎 𝑃 𝑏 𝑎

𝑎𝑏𝑐𝑑

Elimination in Chains

14


Elimination in Chains (cont.)

Now we can perform innermost summation efficiently

𝑃 𝐸

= 𝑃 𝑐 𝑏 𝑃 𝑑 𝑐 𝑃 𝐸 𝑑 𝑃 𝑎 𝑃 𝑏 𝑎

𝑎𝑏𝑐𝑑

= 𝑃 𝑐 𝑏 𝑃 𝑑 𝑐 𝑃 𝐸 𝑑 𝑃(𝑏)

𝑏𝑐𝑑

The innermost summation eliminates one variable from our summation argument at a local cost.

15


𝑃(𝑏)

Equivalent to matrix-vector multiplication, |Val(A)| * |Val(B)|


Rearranging and then summing again, we get

𝑃 𝐸

= 𝑃 𝑐 𝑏 𝑃 𝑑 𝑐 𝑃 𝑒 𝑑 𝑃(𝑏)

𝑏𝑐𝑑

= 𝑃 𝑑 𝑐 𝑃 𝐸 𝑑 𝑃 𝑐 𝑏 𝑃 𝑏

𝑏𝑐𝑑

= 𝑃 𝑑 𝑐 𝑃 𝐸 𝑑 𝑃(𝑐)

𝑐𝑑

16


𝑃(𝑏) 𝑃(𝑐)

Equivalent to matrix-vector multiplication, |Val(B)| * |Val(C)|

C B 0 1

0 0 .15 0.35

1 0.85 0.65

B 0

0 0 .25

1 0.75


Eliminate nodes one by one all the way to the end

𝑃 𝐸 = 𝑃 𝐸 𝑑 𝑃(𝑑)

𝑑

Computational Complexity for a chain of length 𝑘

Each step costs O(|Val(𝑋𝑖)| * |Val(𝑋𝑖+1)|) operations: O(𝑘𝑛2)

Ψ 𝑋𝑖 = 𝑃 𝑋𝑖 𝑋𝑖−1)𝑃(𝑋𝑖−1)𝑥𝑖−1

Compare to naïve summation: O(𝑛𝑘)

… 𝑃(𝑥1, … , 𝑋𝑘)𝑥𝑘−1𝑥1

17


𝑃(𝑏) 𝑃(𝑐)

Undirected Chains

18


Rearrange terms, perform local summation …

𝑃 𝐸

= 1

𝑍Ψ 𝑏, 𝑎 Ψ 𝑐, 𝑏 Ψ 𝑑, 𝑐 Ψ(𝐸, 𝑑)

𝑎𝑏𝑐𝑑

=1

𝑍 Ψ 𝑐, 𝑏 Ψ 𝑑, 𝑐 Ψ 𝐸, 𝑑 Ψ 𝑏, 𝑎

𝑎𝑏𝑐𝑑

=1

𝑍 Ψ 𝑐, 𝑏 Ψ 𝑑, 𝑐 Ψ 𝐸, 𝑑 Ψ 𝑏

𝑏𝑐𝑑

The Sum-Product Operation

During inference, we try to compute an expression

Sum-product form: ΨΨ∈𝓕𝑍

𝓧 = {𝑋1, … , 𝑋𝑛} the set of variables

𝓕 a set of factors such that for each Ψ ∈ 𝓕, 𝑆𝑐𝑜𝑝𝑒 Ψ ∈ 𝓧

𝓨 ⊂ 𝓧 a set of query variables

𝓩 = 𝓧−𝓨 the variables to eliminate

The result of eliminating the variables in 𝓩 is a factor

𝜏 𝓨 = Ψ

Ψ∈𝓕𝑧

This factor does not necessarily correspond to any probability or conditional probability in the network.

𝑃 𝓨 =𝜏(𝓨)

𝜏(𝓨)

19

Inference via Variable Elimination

General Idea

Write query in the form

𝑃 𝑋1, 𝑒 = … 𝑃 𝑥𝑖 𝑃𝑎𝑋𝑖𝑖𝑥2𝑥3𝑥𝑛

The sum is ordered to suggest an elimination order

Then iteratively

Move all irrelevant terms outside of innermost sum

Perform innermost sum, getting a new term

Insert the new term into the product

Finally renormalize

𝑃 𝑋1 𝑒 = 𝜏 𝑋1, 𝑒

𝜏(𝑋1, 𝑒)𝑥1

20

A more complex network

A food web

What is the probability 𝑃 𝐴 𝐻 that hawks are leaving given that the grass condition is poor?

21

𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻

Query: 𝑃(𝐴|ℎ), need to eliminate 𝐵, 𝐶, 𝐷, 𝐸, 𝐹, 𝐺, 𝐻

Initial factors 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑃 𝑔 𝑒 𝑃 ℎ 𝑒, 𝑓

Choose an elimination order: 𝐻, 𝐺, 𝐹, 𝐸, 𝐷, 𝐶, 𝐵 (<)

Step 1: Eliminate G

Conditioning (fix the evidence node on its observed value)

𝑚ℎ 𝑒, 𝑓 = 𝑃(𝐻 = ℎ|𝑒, 𝑓)

Example: Variable Elimination

22

𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻

𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺

Query: 𝑃(𝐴|ℎ), need to eliminate 𝐵, 𝐶, 𝐷, 𝐸, 𝐹, 𝐺


⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑃 𝑔 𝑒 𝑚ℎ(𝑒, 𝑓)

Step 2: Eliminate 𝐺

Compute 𝑚𝑔 𝑒 = 𝑃 𝑔 𝑒 𝑔 = 1

⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑚𝑔 𝑒 𝑚ℎ(𝑒, 𝑓)

⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑚ℎ(𝑒, 𝑓)


23

𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻

𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

Query: 𝑃(𝐴|ℎ), need to eliminate 𝐵, 𝐶, 𝐷, 𝐸, 𝐹


⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑃 𝑔 𝑒 𝑚ℎ 𝑒, 𝑓

⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑚ℎ 𝑒, 𝑓

Step 3: Eliminate 𝐹

Compute 𝑚𝑓 𝑒, 𝑎 = 𝑃 𝑓 𝑎 𝑚ℎ(𝑒, 𝑓) 𝑓

⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑚𝑓(𝑒, 𝑎)


24

𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻

𝐴 𝐵

𝐶 𝐷

𝐸

Query: 𝑃(𝐴|ℎ), need to eliminate 𝐵, 𝐶, 𝐷, 𝐸




⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑚𝑓 𝑎, 𝑒

Step 3: Eliminate 𝐸

Compute 𝑚𝑒 𝑎, 𝑐, 𝑑 = 𝑃 𝑒 𝑐, 𝑑 𝑚𝑓(𝑎, 𝑒) 𝑒

⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑚𝑒(𝑎, 𝑐, 𝑑)


25

𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻

𝐴 𝐵

𝐶 𝐷

Query: 𝑃(𝐴|ℎ), need to eliminate 𝐵, 𝐶, 𝐷





⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑚𝑒 𝑎, 𝑐, 𝑑

Step 3: Eliminate 𝐷

Compute 𝑚𝑑 𝑎, 𝑐 = 𝑃 𝑑 𝑎 𝑚𝑒(𝑎, 𝑐, 𝑑) 𝑑

⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑚𝑑(𝑎, 𝑐)


26

𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻

𝐴 𝐵

𝐶

Query: 𝑃(𝐴|ℎ), need to eliminate 𝐵, 𝐶






⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑚𝑑 𝑎, 𝑐

Step 3: Eliminate 𝐶

Compute 𝑚𝑐 𝑎, 𝑏 = 𝑃 𝑐 𝑏 𝑚𝑑(𝑎, 𝑐) 𝑐

⇒ 𝑃 𝑎 𝑃 𝑏 𝑚𝑐(𝑎, 𝑏)


27

𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻

𝐴 𝐵

𝐶

Query: 𝑃(𝐴|ℎ), need to eliminate 𝐵







⇒ 𝑃 𝑎 𝑃 𝑏 𝑚𝑐 𝑎, 𝑏

Step 3: Eliminate 𝐶

Compute 𝑚𝑏 𝑎 = 𝑃(𝑏)𝑚𝑐(𝑎, 𝑏) 𝑏

⇒ 𝑃 𝑎 𝑚𝑏(𝑎)


28

𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻

𝐴 𝐵

𝐶

Query: 𝑃(𝐴|ℎ), need to renormalize over 𝐴







⇒ 𝑃 𝑎 𝑃 𝑏 𝑚𝑐 𝑎, 𝑏

⇒ 𝑃 𝑎 𝑚𝑏 𝑎

Step 3: renormalize

𝑃 𝑎, ℎ = 𝑃 𝑎 𝑚𝑏 𝑎 , compute 𝑃(ℎ) = 𝑃 𝑎 𝑚𝑏(𝑎)𝑎

⇒ 𝑃 𝑎 ℎ = 𝑃 𝑎 𝑚𝑏(𝑎)

𝑃 𝑎 𝑚𝑏(𝐴)𝑎


29

𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻

𝐴 𝐵

𝐶

Complexity of variable elimination

Suppose in one elimination step we compute

𝑚𝑥 𝑦1, … , 𝑦𝑘 = 𝑚𝑥′ (𝑥, 𝑦1, … , 𝑦𝑘)𝑥

𝑚𝑥′ 𝑥, 𝑦1, … , 𝑦𝑘 = 𝑚𝑖 𝑥, 𝑦𝑐𝑖

𝑘𝑖=1

This requires

𝑘 ∗ 𝑉𝑎𝑙 𝑋 ∗ 𝑉𝑎𝑙 𝑌𝑐𝑖𝑖 multiplications

For each value of 𝑥, 𝑦1, … , 𝑦𝑘, we do k multiplications

𝑉𝑎𝑙 𝑋 ∗ 𝑉𝑎𝑙 𝑌𝑐𝑖𝑖 additions

For each value of 𝑦1, … , 𝑦𝑘, we do 𝑉𝑎𝑙 𝑋 additions

Complexity is exponential in the number of variables in the intermediate factor

30

𝑋

𝑦1 𝑦𝑘 𝑦𝑖


General form of the inference problem

𝑃 𝑋1, … , 𝑋𝑛 ∝ Ψ(𝐷𝑖)𝑖

Want to query 𝑌 variable given evidence 𝑒, and “don’t care” a set of 𝑍 variables

Compute 𝜏 𝑌, 𝑒 = Ψ(𝐷𝑖)𝑖𝑍 using variable elimination

Renormalize to obtain the conditionals 𝑃 𝑌|𝑒 =𝜏(𝑌,𝑒)

𝜏(𝑌,𝑒)𝑌

Two examples: use graph structure

to order computation

31


𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻

Chain:

DAG:

From Variable Elimination to Message Passing

Recall that induced dependency during marginalization is captured in elimination cliques

Summation Elimination

Intermediate term Elimination cliques

Can this lead to an generic inference algorithm?

32

Nice localization in computation

𝑃 𝐸 = 𝑃 𝑎)𝑃 𝑏 𝑎 𝑃 𝑐 𝑏 𝑃 𝑑 𝑐 𝑃(𝐸|𝑑𝑎𝑏𝑐𝑑

𝑃 𝐸 = 𝑃 𝐸 𝑑 𝑃 𝑑 𝑐 ( 𝑃 𝑐 𝑏 𝑃 𝑏 𝑎 𝑃 𝑎)𝑎𝑏𝑐𝑑

Chain: Query E

33


𝑚𝐴𝐵 𝑏

𝑚𝐵𝐶 𝑐

𝑚𝐶𝐷 𝑑

𝑃 𝐸 = 𝑚𝐷𝐸 𝐸

𝑚𝐴𝐵 𝑏 𝑚𝐵𝐶 𝑐 𝑚𝐶𝐷 𝑑 𝑚𝐷𝐸 𝐸

Start elimination away from the query variable

𝑃(𝐶) = 𝑃 𝑎)𝑃 𝑏 𝑎 𝑃 𝑐 𝑏 𝑃 𝑑 𝑐 𝑃(𝑒|𝑑𝑎𝑏𝑒𝑑

𝑃(𝐶) = ( 𝑃 𝑑 𝐶 ( 𝑃(𝑒|𝑑))) ( 𝑃 𝐶 𝑏 ( 𝑃 𝑏 𝑎 𝑃 𝑎𝑎𝑏 )𝑒𝑑 )

Chain: Query C

34


𝑚𝐴𝐵 𝑏

𝑚𝐵𝐶 𝐶

𝑚𝐷𝐸 𝑑

𝑚𝐷𝐶 𝐶

𝑃 𝐶 = 𝑚𝐷𝐶 𝐶 𝑚𝐵𝐶(𝐶)

𝑚𝐴𝐵 𝑏 𝑚𝐵𝐶 𝐶 𝑚𝐷𝐶 𝐶 𝑚𝐸𝐷 𝑑

Chain: What if I want to query everybody

𝑃 𝐵 = ( 𝑃 𝑐 𝐵 ( 𝑃 𝑑 𝑐𝑑𝑐 ( 𝑃 𝑒 𝑑 )))𝑒 𝑃 𝐵 𝑎 𝑃 𝑎𝑎

Query 𝑃 𝐴 , 𝑃 𝐵 , 𝑃 𝐶 , 𝑃 𝐷 , 𝑃(𝐸)

Computational cost

Each message 𝑂 𝐾2

Chain length is 𝐿

Cost for each query is about 𝑂 𝐿𝐾2

For 𝐿 queries, cost is about 𝑂 𝐿2𝐾2

35


𝑚𝐴𝐵 𝐵 𝑚𝐶𝐵 𝐵 𝑚𝐷𝐶 𝑐 𝑚𝐸𝐷 𝑑

What is shared in these queries?

𝑃 𝐵 = ( 𝑃 𝑐 𝐵 ( 𝑃 𝑑 𝑐𝑑𝑐 ( 𝑃 𝑒 𝑑 )))𝑒 𝑃 𝐵 𝑎 𝑃 𝑎𝑎

𝑃 𝐸 = 𝑃 𝐸 𝑑 𝑃 𝑑 𝑐 ( 𝑃 𝑐 𝑏 𝑃 𝑏 𝑎 𝑃 𝑎)𝑎𝑏𝑐𝑑

𝑃 𝐶 = ( 𝑃 𝑑 𝐶 ( 𝑃(𝑒|𝑑))) ( 𝑃 𝐶 𝑏 ( 𝑃 𝑏 𝑎 𝑃 𝑎𝑎𝑏 )𝑒𝑑 )

36


𝑚𝐴𝐵 𝑏 𝑚𝐵𝐶 𝑐 𝑚𝐶𝐷 𝑑 𝑚𝐷𝐸 𝐸


𝑚𝐴𝐵 𝑏 𝑚𝐵𝐶 𝐶 𝑚𝐷𝐶 𝐶 𝑚𝐸𝐷 𝑑


𝑚𝐴𝐵 𝐵 𝑚𝐶𝐵 𝐵 𝑚𝐷𝐶 𝑐 𝑚𝐸𝐷 𝑑

The number of unique message is 2(𝐿 − 1)

Forward-backward algorithm

Compute and cache the 2(𝐿 − 1) unique messages

In query time, just multiply together the messages from the neighbors

eg. 𝑃 𝐷 = 𝑚𝐶𝐷 𝐷 𝑚𝐸𝐷(𝐷)

37


𝑚𝐴𝐵 𝑏 𝑚𝐵𝐶 𝑐 𝑚𝐶𝐷 𝑑 𝑚𝐷𝐸 𝑒

Forward pass:


𝑚𝐵𝐴 𝑎 𝑚𝐶𝐵 𝑏 𝑚𝐷𝐶 𝑐 𝑚𝐸𝐷 𝑑

Backward pass:


𝑚𝐶𝐷 𝐷 𝑚𝐸𝐷 𝐷 For all queries, 𝑂 2𝐿𝐾2

DAG: Variable elimination

Elimination order H, G, F, E, B, C, D

𝑃 𝐴 =

𝑃 𝐴 𝑃 𝑑 𝐴 ( ( 𝑃 𝑏 𝑃 𝑐 𝑏 )( 𝑃 𝑒 𝑐, 𝑑 ( 𝑃 𝑔 𝑒 )( 𝑃 𝑓 𝐴 𝑃 ℎ 𝑒, 𝑓 ))) ℎ 𝑓𝑔 𝑒𝑏𝑐𝑑

38

𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻

𝑚𝐻(𝐸𝐹) 𝑒, 𝑓

𝑚𝐹(𝐴𝐸) 𝐴, 𝑒

𝑚𝐺𝐸 𝑒

𝑚𝐸(𝐴𝐶𝐷) 𝐴, 𝑐, 𝑑

𝑚𝐵𝐶 𝑐

𝑚𝐶(𝐴𝐷) 𝐴, 𝑑

𝑚𝐷𝐴 𝐴

4-way tables

created!

DAG: Cliques of size 4 are generated

39

𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻

𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻


𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻

𝑚𝐺𝐸 𝑒

𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻


𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻

𝑚𝐸(𝐴𝐶𝐷) 𝐴, 𝑐, 𝑑

𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻

𝑚𝐵𝐶 𝑐

𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻

𝑚𝐶(𝐴𝐷) 𝐴, 𝑑

𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻

𝑚𝐷𝐴 𝐴

4-way tables

created!

DAG: A different elimination order

Elimination order G, H, F, B, C, D, E

𝑃 𝐴

= ( 𝑃(𝑑|𝐴)𝑑 𝑃(𝑒|𝑐, 𝑑)𝑐 𝑃 𝑏 𝑃 𝑐 𝑏𝑏 𝑃 𝑓 𝐴 𝑃 ℎ 𝑒, 𝑓ℎ𝑓 𝑃 𝑔 𝑒𝑔 )𝑒

40

𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻

𝑚𝐺𝐸 𝑒



𝑚𝐶(𝐸𝐷) 𝑒, 𝑑

𝑚𝐵𝐶 𝑐

𝑚𝐸𝐴 𝐴

𝑚𝐷(𝐴𝐸) 𝐴, 𝑒

NO 4-way tables!

DAG: No cliques of size 4

41

𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻

𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻

𝑚𝐺𝐸 𝑒

𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻


𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻


𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻

𝑚𝐵𝐶 𝑐

𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻

𝑚𝐶(𝐷𝐸) 𝑑, 𝑒

𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻

𝑚𝐷(𝐴𝐸) 𝐴, 𝑒

𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻

𝑚𝐸𝐴 𝐴

Any thoughts?

Chain has nice properties

forward-backward algorithm works

Immediate results (messages) along edges

Can we generalize to other graphs? (trees, loopy graphs?)

How about undirected trees? Is there a forward-backward algorithm?

Loopy graph is more complicated Different elimination order results in different computational cost

Can we somehow make loopy graph behave like trees?

42

Tree Graphical Models

43

Undirected tree: a unique path between any pair of nodes

Directed tree: all nodes except the root have exactly one parent

Equivalence of directed and undirected trees

Any undirected tree can be converted to a directed tree by choosing a root node and directing all edges away from it

A directed tree and the corresponding undirected tree make the conditional independence assertions

Parameterization are essentially the same

Undirected tree: 𝑃 𝑋 =1

𝑍 Ψ 𝑋𝑖 Ψ(𝑋𝑖 , 𝑋𝑗)(𝑖,𝑗)∈E𝑖∈V

Directed tree: 𝑃 𝑋 = 𝑃 𝑋𝑟 𝑃(𝑋𝑗|𝑋𝑖)𝑖,𝑗 ∈𝐸

Equivalence: Ψ 𝑋𝑖 = 𝑃 𝑋𝑟 , Ψ 𝑋𝑖 , 𝑋𝑗 = 𝑃 𝑋𝑗 𝑋𝑖 , 𝑍 =

1,Ψ 𝑋𝑖 = 1

44

Message passing on trees

Message passed along tree edges

𝑃 𝑋𝑖, 𝑋𝑗 , 𝑋𝑘 , 𝑋𝑙, 𝑋𝑓 ∝

Ψ 𝑋𝑖 Ψ 𝑋𝑗 Ψ 𝑋𝑘 Ψ 𝑋𝑙 Ψ 𝑋𝑓 Ψ 𝑋𝑖 , 𝑋𝑗 Ψ 𝑋𝑘 , 𝑋𝑗 Ψ 𝑋𝑙 , 𝑋𝑗 Ψ(𝑋𝑖 , 𝑋𝑓)

𝑃 𝑓 = Ψ(𝑋𝑓) (Ψ 𝑋𝑖 Ψ 𝑋𝑖 , 𝑋𝑓 Ψ 𝑋𝑗 Ψ 𝑋𝑖 , 𝑋𝑗 ( Ψ 𝑋𝑘 Ψ 𝑋𝑘 , 𝑋𝑗𝑥𝑘 )( Ψ 𝑋𝑙 Ψ 𝑋𝑙 , 𝑋𝑗𝑥𝑙 )𝑥𝑗 )𝑥𝑖

45

𝑓 𝑖 𝑗

𝑘

𝑙

𝑚𝑘𝑗 𝑋𝑗

𝑚𝑙𝑗 𝑋𝑗

𝑚𝑗𝑖 𝑋𝑖 𝑚𝑖𝑓 𝑋𝑓

𝑚𝑙𝑗 𝑋𝑗 𝑚𝑘𝑗 𝑋𝑘

𝑚𝑗𝑖 𝑋𝑖

𝑚𝑖𝑓 𝑋𝑓

Sharing messages on trees

Query f

Query j

46

𝑓 𝑖 𝑗

𝑘

𝑙




𝑓 𝑖 𝑗

𝑘

𝑙



𝑚𝑖𝑗 𝑋𝑗 𝑚𝑓𝑖 𝑋𝑖

Computational cost for all queries

Query 𝑃 𝑋𝑘 , 𝑃 𝑋𝑙 , 𝑃 𝑋𝑗 , 𝑃 𝑋𝑖 , 𝑃 𝑋𝑓

Doing things separately


Number of edges is 𝐿

Cost for each query is about 𝑂 𝐿𝐾2

For 𝐿 queries, cost is about 𝑂 𝐿2𝐾2

47

𝑓 𝑖 𝑗

𝑘

𝑙



𝑚𝑖𝑗 𝑋𝑗 𝑚𝑓𝑖 𝑋𝑖

Forward-backward algorithm in trees

Forward: pick one leave as root, compute all messages, cache

Backward: pick another root, compute all messages, cache

Eg. Query j

48

𝑓 𝑖 𝑗

𝑘

𝑙




𝑓 𝑖 𝑗

𝑘

𝑙

𝑚𝑗𝑘 𝑋𝑘


𝑚𝑖𝑗 𝑋𝑗 𝑚𝑖𝑓 𝑋𝑓

𝑓 𝑖 𝑗

𝑘

𝑙



𝑚𝑖𝑗 𝑋𝑗

resuse

Computational saving for trees

Compute forward and backward messages for each edge, save them

Doing things separately


Number of edges is 𝐿

2𝐿 unique messages

Cost for all queries is about 𝑂 2𝐿𝐾2

49

𝑓 𝑖 𝑗

𝑘

𝑙




𝑚𝑓𝑖 𝑋𝑖 𝑚𝑖𝑗 𝑋𝑗 𝑚𝑗𝑘 𝑋𝑘

𝑚𝑗𝑙 𝑋𝑙

Message passing algorithm

𝑚𝑗𝑖 𝑋𝑖 ∝ Ψ 𝑋𝑖 , 𝑋𝑗𝑋𝑗Ψ 𝑋𝑗 𝑚𝑠𝑗 𝑋𝑗𝑠∈N 𝑗 \i

50

𝑓 𝑖 𝑗

𝑘

𝑙




N 𝑗 \i

𝑝𝑟𝑜𝑑𝑢𝑐𝑡 𝑜𝑓 𝑖𝑛𝑐𝑜𝑚𝑖𝑛𝑔 𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠

𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑦 𝑏𝑦 𝑙𝑜𝑐𝑎𝑙 𝑝𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙𝑠

𝑆𝑢𝑚 𝑜𝑢𝑡 𝑋𝑗 𝑋𝑗 can send

message when incoming messages from 𝑁 𝑗 \i arrive

From Variable Elimination to Message Passing

Recall Variable Elimination Algorithm

Choose an ordering in which the query node 𝑓 is the final node

Eliminate node 𝑖 by removing all potentials containing 𝑖, take sum/product over 𝑥𝑖

Place the resultant factor back

For a Tree graphical model:

Choose query node f as the root of the tree

View tree as a directed tree with edges pointing towards 𝑓

Elimination of each node can be considered as message-passing directly along tree branches, rather than on some transformed graphs

Thus, we can use the tree itself as a data-structure to inference

51

How about general graph?

Trees are nice

Can just compute two messages for each edge

Order computation along the graph

Associate intermediate results with edges

General graph is not so clear

Different elimination generate different cliques and factor size

Computation and immediate results not associated with edges

Local computation view is not so clear

52

𝑓 𝑖 𝑗

𝑘

𝑙




𝑚𝑓𝑖 𝑋𝑖 𝑚𝑖𝑗 𝑋𝑗 𝑚𝑗𝑘 𝑋𝑘

𝑚𝑗𝑙 𝑋𝑙

𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻

𝐴 𝐵

𝐶 𝐷

𝐸 𝐹

𝐺 𝐻

Can we make them tree like or treat them

as trees?

Message passing for loopy graph

Local message passing for trees guarantees the consistency of local marginals

𝑃 𝑋𝑖 computed is the correct one

𝑃 𝑋𝑖 , 𝑋𝑗 computed is the correct on

…

For loopy graphs, no consistency guarantees for local message passing

53

𝑓 𝑖 𝑗

𝑘

𝑙




Inference for loopy graph models is NP-hard in general

Treat loopy graphs locally as if they were trees

Iteratively estimate the marginal

Read in messages

Process messages

Send updated out messages

Repeat for all variables until convergence

Loopy belief propagation

54

A

Message update schedule

Synchronous update:

𝑋𝑗 can send message when incoming messages from 𝑁 𝑗 \i

arrive

Slow

Provably correct for tree, may converge for loopy graphs

Asynchronous update:

𝑋𝑗 can send message when there is a change in any incoming messages

from 𝑁 𝑗 \i

Fast

Not easy to prove convergence, but empirically it often works

55

Modeling Rich Structured Data via Kernel Distribution...

Documents

Transcript of Modeling Rich Structured Data via Kernel Distribution...