Modeling Rich Structured Data via Kernel Distribution...
Transcript of Modeling Rich Structured Data via Kernel Distribution...
Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012
Le Song
Lecture 19, Nov 1, 2012
Reading: Chap 8, C. Bishop Book
Inference in Graphical Models
Conditional Independence Assumptions
Global Markov Assumption
๐ด โฅ ๐ต|๐ถ, ๐ ๐๐๐บ ๐ด, ๐ต; ๐ถ
2
Local Markov Assumption
๐ โฅ ๐๐๐๐๐๐ ๐๐๐๐๐๐๐ก๐|๐๐๐
๐ด ๐ถ ๐ต ๐
๐๐๐
๐๐๐๐๐๐ ๐๐๐๐๐๐๐ก๐
๐ต๐ ๐๐
๐
๐ต๐ ๐๐
Moralize
Triangulate
Undirected Tree Undirected Chordal Graph
Distribution Factorization
Bayesian Networks (Directed Graphical Models) ๐ผ โ ๐๐๐: ๐ผ๐ ๐บ โ ๐ผ ๐
โ
๐(๐1, โฆ , ๐๐) = ๐(๐๐ | ๐๐๐๐)
๐
๐=1
3
Markov Networks (Undirected Graphical Models) ๐ ๐ก๐๐๐๐ก๐๐ฆ ๐๐๐ ๐๐ก๐๐ฃ๐ ๐, ๐ผ โ ๐๐๐: ๐ผ ๐บ โ ๐ผ ๐
โ
๐(๐1, โฆ , ๐๐) = 1
๐ ฮจ๐ ๐ท๐
๐
๐=1
๐ = ฮจ๐ ๐ท๐
๐
๐=1๐ฅ1,๐ฅ2,โฆ,๐ฅ๐
Clique Potentials
Conditional Probability Tables (CPTs)
Maximal Clique Normalization
(Partition Function)
Inference in Graphical Models
Graphical models give compact representations of probabilistic distributions ๐ ๐1, โฆ , ๐๐ (n-way tables to much smaller tables)
How do we answer queries about ๐?
Compute likelihood
Compute conditionals
Compute maximum a posteriori assignment
We use inference as a name for the process of computing answers to such queries
4
Most queries involve evidence
Evidence ๐ is an assignment of values to a set ๐ธ variables
Evidence are observations on some variables
Without loss of generality ๐ธ = ๐๐+1, โฆ , ๐๐
Simplest query: compute probability of evidence
๐ ๐ = โฆ ๐(๐ฅ1, โฆ , ๐ฅ๐ , ๐)๐ฅ๐๐ฅ1
This is often referred to as computing the likelihood of ๐
Query Type 1: Likelihood
5
๐ธ
Sum over this set of variables
Query Type 2: Conditional Probability
Often we are interested in the conditional probability distribution of a variable given the evidence
๐ ๐ ๐ =๐ ๐, ๐
๐ ๐=๐(๐, ๐)
๐(๐ = ๐ฅ, ๐)๐ฅ
It is also called a posteriori belief in ๐ given evidence ๐
We usually query a subset ๐ of all variables ๐ณ = {๐, ๐, ๐} and โdonโt careโ about the remaining ๐
๐ ๐ ๐ = ๐(๐, ๐ = ๐ง|๐)
๐ง
Take all possible configuration of ๐ into account
The processes of summing out the unwanted variable Z is called marginalization
6
Query Type 2: Conditional Probability Example
7
๐ธ
Sum over this set of variables
๐ธ Sum over this set of variables
Interested in the conditionals for these variables
Interested in the conditionals for these variables
Prediction: what is the probability of an outcome given the starting condition
The query node is a descendent of the evidence
Diagnosis: what is the probability of disease/fault given symptoms
The query node is an ancestor of the evidence
Learning under partial observations (Fill in the unobserved)
Information can flow in either direction
Inference can combine evidence from all parts of the networks
Application of a posteriori Belief
8
๐ด ๐ต ๐ถ
๐ด ๐ต ๐ถ
Query Type 3: Most Probable Assignment
Want to find the most probably joint assignment for some variables of interests
Such reasoning is usually performed under some given evidence ๐, and ignoring (the values of other variables) ๐
Also called maximum a posteriori (MAP) assignment for ๐
๐๐ด๐ ๐ ๐ = ๐๐๐๐๐๐ฅ๐ฆ ๐ ๐ ๐ = ๐๐๐๐๐๐ฅ๐ฆ ๐ ๐, ๐ = ๐ง ๐ ๐ง
9
๐ธ
Sum over this set of variables
Interested in the most probable values for these variables
Application of MAP assignment
Classification
Find most likely label, given the evidence
Explanation
What is the most likely scenario, given the evidence
Cautionary note:
The MAP assignment of a variable dependence on its context โ the set of variables being jointly queried
Example:
MAP of ๐, ๐ ?
(0, 0)
MAP of ๐?
1
10
X Y P(X,Y)
0 0 0.35
0 1 0.05
1 0 0.3
1 1 0.3
X P(X)
0 0.4
1 0.6
Computing the a posteriori belief ๐ ๐ ๐ in a GM is NP-hard in general
Hardness implies we cannot find a general procedure that works efficiently for arbitrary GMs
For particular families of GMs, we can have provably efficient procedures
For some families of GMs, we need to design efficient approximate inference algorithms
Complexity of Inference
11
eg. trees
eg. grids
Approaches to inference
Exact inference algorithms
Variable elimination algorithm
Message-passing algorithm (sum-product, belief propagation algorithm)
The junction tree algorithm
Approximate inference algorithms
Sampling methods/Stochastic simulation
Variational algorithms
12
Marginalization and Elimination
A metabolic pathway: What is the likelihood protein ๐ธ is produced
Query: P(E)
๐ ๐ธ = ๐ ๐, ๐, ๐, ๐, ๐ธ๐๐๐๐
Using graphical models, we get
๐ ๐ธ = ๐ ๐)๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐(๐ธ|๐๐๐๐๐
13
๐ด ๐ต ๐ถ ๐ท ๐ธ
Naรฏve summation needs to enumerate over an
exponential number of terms
Rearranging terms and the summations
๐ ๐ธ
= ๐ ๐)๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐(๐ธ|๐
๐๐๐๐
= ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ธ ๐ ๐ ๐ ๐ ๐ ๐
๐๐๐๐
Elimination in Chains
14
๐ด ๐ต ๐ถ ๐ท ๐ธ
Elimination in Chains (cont.)
Now we can perform innermost summation efficiently
๐ ๐ธ
= ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ธ ๐ ๐ ๐ ๐ ๐ ๐
๐๐๐๐
= ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ธ ๐ ๐(๐)
๐๐๐
The innermost summation eliminates one variable from our summation argument at a local cost.
15
๐ด ๐ต ๐ถ ๐ท ๐ธ
๐(๐)
Equivalent to matrix-vector multiplication, |Val(A)| * |Val(B)|
Elimination in Chains (cont.)
Rearranging and then summing again, we get
๐ ๐ธ
= ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐(๐)
๐๐๐
= ๐ ๐ ๐ ๐ ๐ธ ๐ ๐ ๐ ๐ ๐ ๐
๐๐๐
= ๐ ๐ ๐ ๐ ๐ธ ๐ ๐(๐)
๐๐
16
๐ด ๐ต ๐ถ ๐ท ๐ธ
๐(๐) ๐(๐)
Equivalent to matrix-vector multiplication, |Val(B)| * |Val(C)|
C B 0 1
0 0 .15 0.35
1 0.85 0.65
B 0
0 0 .25
1 0.75
Elimination in Chains (cont.)
Eliminate nodes one by one all the way to the end
๐ ๐ธ = ๐ ๐ธ ๐ ๐(๐)
๐
Computational Complexity for a chain of length ๐
Each step costs O(|Val(๐๐)| * |Val(๐๐+1)|) operations: O(๐๐2)
ฮจ ๐๐ = ๐ ๐๐ ๐๐โ1)๐(๐๐โ1)๐ฅ๐โ1
Compare to naรฏve summation: O(๐๐)
โฆ ๐(๐ฅ1, โฆ , ๐๐)๐ฅ๐โ1๐ฅ1
17
๐ด ๐ต ๐ถ ๐ท ๐ธ
๐(๐) ๐(๐)
Undirected Chains
18
๐ด ๐ต ๐ถ ๐ท ๐ธ
Rearrange terms, perform local summation โฆ
๐ ๐ธ
= 1
๐ฮจ ๐, ๐ ฮจ ๐, ๐ ฮจ ๐, ๐ ฮจ(๐ธ, ๐)
๐๐๐๐
=1
๐ ฮจ ๐, ๐ ฮจ ๐, ๐ ฮจ ๐ธ, ๐ ฮจ ๐, ๐
๐๐๐๐
=1
๐ ฮจ ๐, ๐ ฮจ ๐, ๐ ฮจ ๐ธ, ๐ ฮจ ๐
๐๐๐
The Sum-Product Operation
During inference, we try to compute an expression
Sum-product form: ฮจฮจโ๐๐
๐ง = {๐1, โฆ , ๐๐} the set of variables
๐ a set of factors such that for each ฮจ โ ๐, ๐๐๐๐๐ ฮจ โ ๐ง
๐จ โ ๐ง a set of query variables
๐ฉ = ๐งโ๐จ the variables to eliminate
The result of eliminating the variables in ๐ฉ is a factor
๐ ๐จ = ฮจ
ฮจโ๐๐ง
This factor does not necessarily correspond to any probability or conditional probability in the network.
๐ ๐จ =๐(๐จ)
๐(๐จ)
19
Inference via Variable Elimination
General Idea
Write query in the form
๐ ๐1, ๐ = โฆ ๐ ๐ฅ๐ ๐๐๐๐๐๐ฅ2๐ฅ3๐ฅ๐
The sum is ordered to suggest an elimination order
Then iteratively
Move all irrelevant terms outside of innermost sum
Perform innermost sum, getting a new term
Insert the new term into the product
Finally renormalize
๐ ๐1 ๐ = ๐ ๐1, ๐
๐(๐1, ๐)๐ฅ1
20
A more complex network
A food web
What is the probability ๐ ๐ด ๐ป that hawks are leaving given that the grass condition is poor?
21
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
Query: ๐(๐ด|โ), need to eliminate ๐ต, ๐ถ, ๐ท, ๐ธ, ๐น, ๐บ, ๐ป
Initial factors ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ โ ๐, ๐
Choose an elimination order: ๐ป, ๐บ, ๐น, ๐ธ, ๐ท, ๐ถ, ๐ต (<)
Step 1: Eliminate G
Conditioning (fix the evidence node on its observed value)
๐โ ๐, ๐ = ๐(๐ป = โ|๐, ๐)
Example: Variable Elimination
22
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ
Query: ๐(๐ด|โ), need to eliminate ๐ต, ๐ถ, ๐ท, ๐ธ, ๐น, ๐บ
Initial factors ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ โ ๐, ๐
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐โ(๐, ๐)
Step 2: Eliminate ๐บ
Compute ๐๐ ๐ = ๐ ๐ ๐ ๐ = 1
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐ ๐ ๐ ๐๐ ๐ ๐โ(๐, ๐)
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐ ๐ ๐ ๐โ(๐, ๐)
Example: Variable Elimination
23
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
Query: ๐(๐ด|โ), need to eliminate ๐ต, ๐ถ, ๐ท, ๐ธ, ๐น
Initial factors ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ โ ๐, ๐
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐โ ๐, ๐
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐ ๐ ๐ ๐โ ๐, ๐
Step 3: Eliminate ๐น
Compute ๐๐ ๐, ๐ = ๐ ๐ ๐ ๐โ(๐, ๐) ๐
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐๐(๐, ๐)
Example: Variable Elimination
24
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
๐ด ๐ต
๐ถ ๐ท
๐ธ
Query: ๐(๐ด|โ), need to eliminate ๐ต, ๐ถ, ๐ท, ๐ธ
Initial factors ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ โ ๐, ๐
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐โ ๐, ๐
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐ ๐ ๐ ๐โ ๐, ๐
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐๐ ๐, ๐
Step 3: Eliminate ๐ธ
Compute ๐๐ ๐, ๐, ๐ = ๐ ๐ ๐, ๐ ๐๐(๐, ๐) ๐
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐๐(๐, ๐, ๐)
Example: Variable Elimination
25
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
๐ด ๐ต
๐ถ ๐ท
Query: ๐(๐ด|โ), need to eliminate ๐ต, ๐ถ, ๐ท
Initial factors ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ โ ๐, ๐
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐โ ๐, ๐
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐ ๐ ๐ ๐โ ๐, ๐
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐๐ ๐, ๐
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐๐ ๐, ๐, ๐
Step 3: Eliminate ๐ท
Compute ๐๐ ๐, ๐ = ๐ ๐ ๐ ๐๐(๐, ๐, ๐) ๐
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐๐(๐, ๐)
Example: Variable Elimination
26
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
๐ด ๐ต
๐ถ
Query: ๐(๐ด|โ), need to eliminate ๐ต, ๐ถ
Initial factors ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ โ ๐, ๐
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐โ ๐, ๐
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐ ๐ ๐ ๐โ ๐, ๐
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐๐ ๐, ๐
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐๐ ๐, ๐, ๐
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐๐ ๐, ๐
Step 3: Eliminate ๐ถ
Compute ๐๐ ๐, ๐ = ๐ ๐ ๐ ๐๐(๐, ๐) ๐
โ ๐ ๐ ๐ ๐ ๐๐(๐, ๐)
Example: Variable Elimination
27
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
๐ด ๐ต
๐ถ
Query: ๐(๐ด|โ), need to eliminate ๐ต
Initial factors ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ โ ๐, ๐
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐โ ๐, ๐
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐ ๐ ๐ ๐โ ๐, ๐
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐๐ ๐, ๐
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐๐ ๐, ๐, ๐
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐๐ ๐, ๐
โ ๐ ๐ ๐ ๐ ๐๐ ๐, ๐
Step 3: Eliminate ๐ถ
Compute ๐๐ ๐ = ๐(๐)๐๐(๐, ๐) ๐
โ ๐ ๐ ๐๐(๐)
Example: Variable Elimination
28
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
๐ด ๐ต
๐ถ
Query: ๐(๐ด|โ), need to renormalize over ๐ด
Initial factors ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ โ ๐, ๐
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐โ ๐, ๐
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐ ๐ ๐ ๐โ ๐, ๐
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐๐ ๐, ๐
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐๐ ๐, ๐, ๐
โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐๐ ๐, ๐
โ ๐ ๐ ๐ ๐ ๐๐ ๐, ๐
โ ๐ ๐ ๐๐ ๐
Step 3: renormalize
๐ ๐, โ = ๐ ๐ ๐๐ ๐ , compute ๐(โ) = ๐ ๐ ๐๐(๐)๐
โ ๐ ๐ โ = ๐ ๐ ๐๐(๐)
๐ ๐ ๐๐(๐ด)๐
Example: Variable Elimination
29
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
๐ด ๐ต
๐ถ
Complexity of variable elimination
Suppose in one elimination step we compute
๐๐ฅ ๐ฆ1, โฆ , ๐ฆ๐ = ๐๐ฅโฒ (๐ฅ, ๐ฆ1, โฆ , ๐ฆ๐)๐ฅ
๐๐ฅโฒ ๐ฅ, ๐ฆ1, โฆ , ๐ฆ๐ = ๐๐ ๐ฅ, ๐ฆ๐๐
๐๐=1
This requires
๐ โ ๐๐๐ ๐ โ ๐๐๐ ๐๐๐๐ multiplications
For each value of ๐ฅ, ๐ฆ1, โฆ , ๐ฆ๐, we do k multiplications
๐๐๐ ๐ โ ๐๐๐ ๐๐๐๐ additions
For each value of ๐ฆ1, โฆ , ๐ฆ๐, we do ๐๐๐ ๐ additions
Complexity is exponential in the number of variables in the intermediate factor
30
๐
๐ฆ1 ๐ฆ๐ ๐ฆ๐
Inference in Graphical Models
General form of the inference problem
๐ ๐1, โฆ , ๐๐ โ ฮจ(๐ท๐)๐
Want to query ๐ variable given evidence ๐, and โdonโt careโ a set of ๐ variables
Compute ๐ ๐, ๐ = ฮจ(๐ท๐)๐๐ using variable elimination
Renormalize to obtain the conditionals ๐ ๐|๐ =๐(๐,๐)
๐(๐,๐)๐
Two examples: use graph structure
to order computation
31
๐ด ๐ต ๐ถ ๐ท ๐ธ
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
Chain:
DAG:
From Variable Elimination to Message Passing
Recall that induced dependency during marginalization is captured in elimination cliques
Summation Elimination
Intermediate term Elimination cliques
Can this lead to an generic inference algorithm?
32
Nice localization in computation
๐ ๐ธ = ๐ ๐)๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐(๐ธ|๐๐๐๐๐
๐ ๐ธ = ๐ ๐ธ ๐ ๐ ๐ ๐ ( ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐)๐๐๐๐
Chain: Query E
33
๐ด ๐ต ๐ถ ๐ท ๐ธ
๐๐ด๐ต ๐
๐๐ต๐ถ ๐
๐๐ถ๐ท ๐
๐ ๐ธ = ๐๐ท๐ธ ๐ธ
๐๐ด๐ต ๐ ๐๐ต๐ถ ๐ ๐๐ถ๐ท ๐ ๐๐ท๐ธ ๐ธ
Start elimination away from the query variable
๐(๐ถ) = ๐ ๐)๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐(๐|๐๐๐๐๐
๐(๐ถ) = ( ๐ ๐ ๐ถ ( ๐(๐|๐))) ( ๐ ๐ถ ๐ ( ๐ ๐ ๐ ๐ ๐๐๐ )๐๐ )
Chain: Query C
34
๐ด ๐ต ๐ถ ๐ท ๐ธ
๐๐ด๐ต ๐
๐๐ต๐ถ ๐ถ
๐๐ท๐ธ ๐
๐๐ท๐ถ ๐ถ
๐ ๐ถ = ๐๐ท๐ถ ๐ถ ๐๐ต๐ถ(๐ถ)
๐๐ด๐ต ๐ ๐๐ต๐ถ ๐ถ ๐๐ท๐ถ ๐ถ ๐๐ธ๐ท ๐
Chain: What if I want to query everybody
๐ ๐ต = ( ๐ ๐ ๐ต ( ๐ ๐ ๐๐๐ ( ๐ ๐ ๐ )))๐ ๐ ๐ต ๐ ๐ ๐๐
Query ๐ ๐ด , ๐ ๐ต , ๐ ๐ถ , ๐ ๐ท , ๐(๐ธ)
Computational cost
Each message ๐ ๐พ2
Chain length is ๐ฟ
Cost for each query is about ๐ ๐ฟ๐พ2
For ๐ฟ queries, cost is about ๐ ๐ฟ2๐พ2
35
๐ด ๐ต ๐ถ ๐ท ๐ธ
๐๐ด๐ต ๐ต ๐๐ถ๐ต ๐ต ๐๐ท๐ถ ๐ ๐๐ธ๐ท ๐
What is shared in these queries?
๐ ๐ต = ( ๐ ๐ ๐ต ( ๐ ๐ ๐๐๐ ( ๐ ๐ ๐ )))๐ ๐ ๐ต ๐ ๐ ๐๐
๐ ๐ธ = ๐ ๐ธ ๐ ๐ ๐ ๐ ( ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐)๐๐๐๐
๐ ๐ถ = ( ๐ ๐ ๐ถ ( ๐(๐|๐))) ( ๐ ๐ถ ๐ ( ๐ ๐ ๐ ๐ ๐๐๐ )๐๐ )
36
๐ด ๐ต ๐ถ ๐ท ๐ธ
๐๐ด๐ต ๐ ๐๐ต๐ถ ๐ ๐๐ถ๐ท ๐ ๐๐ท๐ธ ๐ธ
๐ด ๐ต ๐ถ ๐ท ๐ธ
๐๐ด๐ต ๐ ๐๐ต๐ถ ๐ถ ๐๐ท๐ถ ๐ถ ๐๐ธ๐ท ๐
๐ด ๐ต ๐ถ ๐ท ๐ธ
๐๐ด๐ต ๐ต ๐๐ถ๐ต ๐ต ๐๐ท๐ถ ๐ ๐๐ธ๐ท ๐
The number of unique message is 2(๐ฟ โ 1)
Forward-backward algorithm
Compute and cache the 2(๐ฟ โ 1) unique messages
In query time, just multiply together the messages from the neighbors
eg. ๐ ๐ท = ๐๐ถ๐ท ๐ท ๐๐ธ๐ท(๐ท)
37
๐ด ๐ต ๐ถ ๐ท ๐ธ
๐๐ด๐ต ๐ ๐๐ต๐ถ ๐ ๐๐ถ๐ท ๐ ๐๐ท๐ธ ๐
Forward pass:
๐ด ๐ต ๐ถ ๐ท ๐ธ
๐๐ต๐ด ๐ ๐๐ถ๐ต ๐ ๐๐ท๐ถ ๐ ๐๐ธ๐ท ๐
Backward pass:
๐ด ๐ต ๐ถ ๐ท ๐ธ
๐๐ถ๐ท ๐ท ๐๐ธ๐ท ๐ท For all queries, ๐ 2๐ฟ๐พ2
DAG: Variable elimination
Elimination order H, G, F, E, B, C, D
๐ ๐ด =
๐ ๐ด ๐ ๐ ๐ด ( ( ๐ ๐ ๐ ๐ ๐ )( ๐ ๐ ๐, ๐ ( ๐ ๐ ๐ )( ๐ ๐ ๐ด ๐ โ ๐, ๐ ))) โ ๐๐ ๐๐๐๐
38
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
๐๐ป(๐ธ๐น) ๐, ๐
๐๐น(๐ด๐ธ) ๐ด, ๐
๐๐บ๐ธ ๐
๐๐ธ(๐ด๐ถ๐ท) ๐ด, ๐, ๐
๐๐ต๐ถ ๐
๐๐ถ(๐ด๐ท) ๐ด, ๐
๐๐ท๐ด ๐ด
4-way tables
created!
DAG: Cliques of size 4 are generated
39
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
๐๐ป(๐ธ๐น) ๐, ๐
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
๐๐บ๐ธ ๐
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
๐๐น(๐ด๐ธ) ๐ด, ๐
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
๐๐ธ(๐ด๐ถ๐ท) ๐ด, ๐, ๐
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
๐๐ต๐ถ ๐
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
๐๐ถ(๐ด๐ท) ๐ด, ๐
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
๐๐ท๐ด ๐ด
4-way tables
created!
DAG: A different elimination order
Elimination order G, H, F, B, C, D, E
๐ ๐ด
= ( ๐(๐|๐ด)๐ ๐(๐|๐, ๐)๐ ๐ ๐ ๐ ๐ ๐๐ ๐ ๐ ๐ด ๐ โ ๐, ๐โ๐ ๐ ๐ ๐๐ )๐
40
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
๐๐บ๐ธ ๐
๐๐น(๐ด๐ธ) ๐ด, ๐
๐๐ป(๐ธ๐น) ๐, ๐
๐๐ถ(๐ธ๐ท) ๐, ๐
๐๐ต๐ถ ๐
๐๐ธ๐ด ๐ด
๐๐ท(๐ด๐ธ) ๐ด, ๐
NO 4-way tables!
DAG: No cliques of size 4
41
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
๐๐บ๐ธ ๐
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
๐๐ป(๐ธ๐น) ๐, ๐
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
๐๐น(๐ด๐ธ) ๐ด, ๐
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
๐๐ต๐ถ ๐
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
๐๐ถ(๐ท๐ธ) ๐, ๐
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
๐๐ท(๐ด๐ธ) ๐ด, ๐
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
๐๐ธ๐ด ๐ด
Any thoughts?
Chain has nice properties
forward-backward algorithm works
Immediate results (messages) along edges
Can we generalize to other graphs? (trees, loopy graphs?)
How about undirected trees? Is there a forward-backward algorithm?
Loopy graph is more complicated Different elimination order results in different computational cost
Can we somehow make loopy graph behave like trees?
42
Tree Graphical Models
43
Undirected tree: a unique path between any pair of nodes
Directed tree: all nodes except the root have exactly one parent
Equivalence of directed and undirected trees
Any undirected tree can be converted to a directed tree by choosing a root node and directing all edges away from it
A directed tree and the corresponding undirected tree make the conditional independence assertions
Parameterization are essentially the same
Undirected tree: ๐ ๐ =1
๐ ฮจ ๐๐ ฮจ(๐๐ , ๐๐)(๐,๐)โE๐โV
Directed tree: ๐ ๐ = ๐ ๐๐ ๐(๐๐|๐๐)๐,๐ โ๐ธ
Equivalence: ฮจ ๐๐ = ๐ ๐๐ , ฮจ ๐๐ , ๐๐ = ๐ ๐๐ ๐๐ , ๐ =
1,ฮจ ๐๐ = 1
44
Message passing on trees
Message passed along tree edges
๐ ๐๐, ๐๐ , ๐๐ , ๐๐, ๐๐ โ
ฮจ ๐๐ ฮจ ๐๐ ฮจ ๐๐ ฮจ ๐๐ ฮจ ๐๐ ฮจ ๐๐ , ๐๐ ฮจ ๐๐ , ๐๐ ฮจ ๐๐ , ๐๐ ฮจ(๐๐ , ๐๐)
๐ ๐ = ฮจ(๐๐) (ฮจ ๐๐ ฮจ ๐๐ , ๐๐ ฮจ ๐๐ ฮจ ๐๐ , ๐๐ ( ฮจ ๐๐ ฮจ ๐๐ , ๐๐๐ฅ๐ )( ฮจ ๐๐ ฮจ ๐๐ , ๐๐๐ฅ๐ )๐ฅ๐ )๐ฅ๐
45
๐ ๐ ๐
๐
๐
๐๐๐ ๐๐
๐๐๐ ๐๐
๐๐๐ ๐๐ ๐๐๐ ๐๐
๐๐๐ ๐๐ ๐๐๐ ๐๐
๐๐๐ ๐๐
๐๐๐ ๐๐
Sharing messages on trees
Query f
Query j
46
๐ ๐ ๐
๐
๐
๐๐๐ ๐๐
๐๐๐ ๐๐
๐๐๐ ๐๐ ๐๐๐ ๐๐
๐ ๐ ๐
๐
๐
๐๐๐ ๐๐
๐๐๐ ๐๐
๐๐๐ ๐๐ ๐๐๐ ๐๐
Computational cost for all queries
Query ๐ ๐๐ , ๐ ๐๐ , ๐ ๐๐ , ๐ ๐๐ , ๐ ๐๐
Doing things separately
Each message ๐ ๐พ2
Number of edges is ๐ฟ
Cost for each query is about ๐ ๐ฟ๐พ2
For ๐ฟ queries, cost is about ๐ ๐ฟ2๐พ2
47
๐ ๐ ๐
๐
๐
๐๐๐ ๐๐
๐๐๐ ๐๐
๐๐๐ ๐๐ ๐๐๐ ๐๐
Forward-backward algorithm in trees
Forward: pick one leave as root, compute all messages, cache
Backward: pick another root, compute all messages, cache
Eg. Query j
48
๐ ๐ ๐
๐
๐
๐๐๐ ๐๐
๐๐๐ ๐๐
๐๐๐ ๐๐ ๐๐๐ ๐๐
๐ ๐ ๐
๐
๐
๐๐๐ ๐๐
๐๐๐ ๐๐
๐๐๐ ๐๐ ๐๐๐ ๐๐
๐ ๐ ๐
๐
๐
๐๐๐ ๐๐
๐๐๐ ๐๐
๐๐๐ ๐๐
resuse
Computational saving for trees
Compute forward and backward messages for each edge, save them
Doing things separately
Each message ๐ ๐พ2
Number of edges is ๐ฟ
2๐ฟ unique messages
Cost for all queries is about ๐ 2๐ฟ๐พ2
49
๐ ๐ ๐
๐
๐
๐๐๐ ๐๐
๐๐๐ ๐๐
๐๐๐ ๐๐ ๐๐๐ ๐๐
๐๐๐ ๐๐ ๐๐๐ ๐๐ ๐๐๐ ๐๐
๐๐๐ ๐๐
Message passing algorithm
๐๐๐ ๐๐ โ ฮจ ๐๐ , ๐๐๐๐ฮจ ๐๐ ๐๐ ๐ ๐๐๐ โN ๐ \i
50
๐ ๐ ๐
๐
๐
๐๐๐ ๐๐
๐๐๐ ๐๐
๐๐๐ ๐๐
N ๐ \i
๐๐๐๐๐ข๐๐ก ๐๐ ๐๐๐๐๐๐๐๐ ๐๐๐ ๐ ๐๐๐๐
๐๐ข๐๐ก๐๐๐๐ฆ ๐๐ฆ ๐๐๐๐๐ ๐๐๐ก๐๐๐ก๐๐๐๐
๐๐ข๐ ๐๐ข๐ก ๐๐ ๐๐ can send
message when incoming messages from ๐ ๐ \i arrive
From Variable Elimination to Message Passing
Recall Variable Elimination Algorithm
Choose an ordering in which the query node ๐ is the final node
Eliminate node ๐ by removing all potentials containing ๐, take sum/product over ๐ฅ๐
Place the resultant factor back
For a Tree graphical model:
Choose query node f as the root of the tree
View tree as a directed tree with edges pointing towards ๐
Elimination of each node can be considered as message-passing directly along tree branches, rather than on some transformed graphs
Thus, we can use the tree itself as a data-structure to inference
51
How about general graph?
Trees are nice
Can just compute two messages for each edge
Order computation along the graph
Associate intermediate results with edges
General graph is not so clear
Different elimination generate different cliques and factor size
Computation and immediate results not associated with edges
Local computation view is not so clear
52
๐ ๐ ๐
๐
๐
๐๐๐ ๐๐
๐๐๐ ๐๐
๐๐๐ ๐๐ ๐๐๐ ๐๐
๐๐๐ ๐๐ ๐๐๐ ๐๐ ๐๐๐ ๐๐
๐๐๐ ๐๐
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
๐ด ๐ต
๐ถ ๐ท
๐ธ ๐น
๐บ ๐ป
Can we make them tree like or treat them
as trees?
Message passing for loopy graph
Local message passing for trees guarantees the consistency of local marginals
๐ ๐๐ computed is the correct one
๐ ๐๐ , ๐๐ computed is the correct on
โฆ
For loopy graphs, no consistency guarantees for local message passing
53
๐ ๐ ๐
๐
๐
๐๐๐ ๐๐
๐๐๐ ๐๐
๐๐๐ ๐๐
Inference for loopy graph models is NP-hard in general
Treat loopy graphs locally as if they were trees
Iteratively estimate the marginal
Read in messages
Process messages
Send updated out messages
Repeat for all variables until convergence
Loopy belief propagation
54
A
Message update schedule
Synchronous update:
๐๐ can send message when incoming messages from ๐ ๐ \i
arrive
Slow
Provably correct for tree, may converge for loopy graphs
Asynchronous update:
๐๐ can send message when there is a change in any incoming messages
from ๐ ๐ \i
Fast
Not easy to prove convergence, but empirically it often works
55