Variational Methods for Graphical Models
description
Transcript of Variational Methods for Graphical Models
Variational Methods for Graphical Models
Micheal I. JordanZoubin GhahramaniTommi S. JaakkolaLawrence K. Saul
Presented by: Afsaneh Shirazi
2
Outline
• Motivation• Inference in graphical models• Exact inference is intractable• Variational methodology
– Sequential approach– Block approach
• Conclusions
3
Motivation(Example: Medical Diagnosis)
symptoms
diseases
What is the most probable disease?
4
Motivation
• We want to answer some queries about our data
• Graphical model is a way to model data• Inference in some graphical models is
intractable (NP-hard)• Variational methods simplify the inference
in graphical models by using approximation
5
Graphical Models
• Directed (Bayesian network)
• Undirected
S1
S3
S5
S4
S2P(S2)
P(S1)
P(S5|S3,S4)
P(S3|S1,S2) P(S4|S3)
(C1)
(C2)
(C3)
6
Inference in Graphical Models
Inference: Given a graphical model, the process of computing answers to queries
• How computationally hard is this decision problem?
• Theorem: Computing P(X = x) in a Bayesian network is NP-hard
7
Why Exact Inference is Intractable?
symptoms
diseases
Diagnose the most probable disease
8
Why Exact Inference is Intractable?
symptoms
diseases
: Observed symptoms
)()|(),( dPdfPdfP f
9
Why Exact Inference is Intractable?
symptoms
diseases:Noisy-OR model)|( dfP i
101
10
Why Exact Inference is Intractable?
symptoms
diseases :Noisy-OR model)|( dfP i
101
))1,0,1(|0( ifP
11
Why Exact Inference is Intractable?
)( 0
)( 0
1)|1(
)1()1()|0()(
0
ij ijij
ij ijij
j
d
i
d
ij
dijii
edfP
e
qqdfP
12
Why Exact Inference is Intractable?
symptoms
diseases
: Observed symptoms
jj
ii dPdfP
dPdfPdfP
)()|(
)()|(),(f
j jjkij kijjkiij ijji ddd
eee*0
*)( 0)0( 000 ...
13
Why Exact Inference is Intractable?
symptoms
diseases
: Observed symptoms
jj
ii dPdfP
dPdfPdfP
)()|(
)()|(),(f
)1(...)1( )( 0)0( 000
kij kijjkiij ijji ddee
14
Reducing the Computational Complexity
Variational Methods
Simple graph for exact methods
Approximate the probability
distribution
Use the role of convexity
15
Express a Function Variationally
• is a concave function)ln(x
))((min )ln(
Hxx
))ln((min )( xxHx
16
Express a Function Variationally
• is a concave function)ln(x
)1)ln((min )ln(
xx
17
Express a Function Variationally
• If the function is not convex or concave: transform the function to a desired form
• Example: logistic function
xexf
11 )( ))((min
)(
Hx
exf
))(ln()( xfxg ))((min)(
Hxxg
Transformation
Approximation
Transforming back
18
Approaches to Variational Methods
• Sequential Approach: (on-line) nodes are transformed in an order, determined during inference process
• Block Approach: (off-line) has obvious substructures
19
Sequential Approach(Two Methods)
Untransformed Graph
Transform one node at a time
Simple Graph for exact methods
Reintroduce one node at a time
Simple Graph for exact methods
Completelytransformed
Graph
20
Sequential Approach (Example)
)( 01)|1( ij ijijd
i edfP
symptoms
diseases
Log Concave
21
Sequential Approach (Example)
)( 01)|1( ij ijijd
i edfP
symptoms
diseases
Log Concave
)(1 fxx ee
)(
)( ][)|1( 0
ij
dfi
jijiiii eedfP
22
Sequential Approach (Example)
symptoms
diseases
)(
)( ][)|1( 0
ij
dfi
jijiiii eedfP
1
434)1( 3edP
)0( 3 dP
23
Sequential Approach (Example)
symptoms
diseases
)(
)( ][)|1( 0
ij
dfi
jijiiii eedfP
1
24
Sequential Approach (Example)
symptoms
diseases
)(
)( ][)|1( 0
ij
dfi
jijiiii eedfP
1
25
Sequential Approach (Upper Bound and Lower Bound)
• We need both lower bound and upper bound
),(),(),(
)|(jj
jj dfPdfP
dfPfdP
)),(()),(()),((
)|(jj
jj dfPLBdfPUB
dfPUBfdP
26
How to Compute Lower Bound for a Concave Function?
• Lower bound for concave functions:
j j
jj
j j
jj
jj
qz
afq
qz
qafzaf
)(
)()(
Variational parameter is probability distribution
jq
27
Block Approach (Overview)
• Off-line application of sequential approach– Identify some structure amenable to exact
inference– Family of probability distribution via
introduction of parameters– Choose best approximation based on
evidence
28
Block Approach (Details)
• KL divergence
}{ )()(ln)()||(
S SPSQSQPQD
)|( EHP
),|( EHQFamily of
),|( *EHQ
Minimize KL divergence
29
Block Approach (Example – Boltzmann machine)
ZeSP
ji i iijiij SSS
0
)|(
Si
Sj
ij
30
Block Approach (Example – Boltzmann machine)
ZeSP
ji i iijiij SSS
0
)|(
Si
Sj=1
1ij
Ej
jijici S 00
31
Block Approach (Example – Boltzmann machine)
si
sj
c
SSS
ZeEHP
ji i icijiij
0
),|(
ii Si
Hi
SiEHQ
1)1(),|(
i
j
32
Block Approach (Example – Boltzmann machine)
si
sji
j
Minimize KL Divergence
j
ijiji )( 0
xex
11 )(
33
Block Approach (Example – Boltzmann machine)
si
sji
j
Minimize KL Divergence
j
ijiji )( 0
Mean field equations: solve for fixed point
34
Conclusions
• Time or space complexity of exact calculation is unacceptable
• Complex graphs can be probabilistically simple
• Inference in simplified models provides bounds on probabilities in the original model
35
36
Extra Slides
37
Concerns
• Approximation accuracy• Strong dependencies can be identified• Not based on convexity transformation• Not able to assure that the framework will
transfer to other examples• Not straightforward to develop a
variational approximation for new architectures
38
Justification for KL Divergence
• Best lower bound on the probability of the evidence
}{
}{
}{
)|(),(ln)|(
)|(),()|(ln
),(ln)(ln
H
H
H
EHQEHPEHQ
EHQEHPEHQ
EHPEP
)(EP
39
EM
• Maximum likelihood parameter estimation:
• Following function is the lower bound on log likelihood
)|( EP
)|(ln)|()|,(ln)|(),(}{
EHQEHQEHPEHQQLH
),()|(ln QLEP KL Divergence between Q(H|E) and P(H|E,)
40
EM
1. Maximize the bound with respect to Q
2. Fix Q, maximize with respect to
),(maxarg :step) (E )()1( kQ
k QLQ
),(maxarg :step) (M )1()1( kk QL
),|( )(kEHP
Traditional EMApproximation to EM algorithm
41
Principle of InferenceDAG
Junction Tree
Inconsistent Junction TreeInitialization
Consistent Junction TreePropagation
)|( eEvVPMarginalization
42
Example: Create Join Tree
X1 X2
Y1 Y2
HMM with 2 time steps:
Junction Tree:
X1,X2X1,Y1 X2,Y2X1 X2
43
Example: Initialization
Variable Associated Cluster
Potential function
X1 X1,Y1
Y1 X1,Y1
X2 X1,X2
Y2 X2,Y2
X1,Y1 P(X1)
X1,Y1 P(X1)P(Y1 | X1)
X1,X 2 P(X2 | X1)
X 2,Y 2 P(Y2 | X2)
X1,X2X1,Y1 X2,Y2X1 X2
44
Example: Collect Evidence
• Choose arbitrary clique, e.g. X1,X2, where all potential functions will be collected.
• Call recursively neighboring cliques for messages:
• 1. Call X1,Y1.– 1. Projection:
– 2. Absorption:
X1 X1,Y1 P(X1,Y1)P(X1)Y1
{X1,Y1} X1
X1,X 2 X1,X 2X1
X1old P(X2 | X1)P(X1)P(X1,X2)
45
Example: Collect Evidence (cont.)
• 2. Call X2,Y2:– 1. Projection:
– 2. Absorption:
X 2 X 2,Y 2 P(Y2 | X2)1Y 2
{X 2,Y 2} X 2
X1,X2X1,Y1 X2,Y2X1 X2
X1,X 2 X1,X 2X 2
X 2old P(X1,X2)
46
Example: Distribute Evidence
• Pass messages recursively to neighboring nodes
• Pass message from X1,X2 to X1,Y1:– 1. Projection:
– 2. Absorption:
X1 X1,X 2 P(X1,X2)P(X1)X 2
{X1,X 2} X1
X1,Y1 X1,Y1X1
X1old P(X1,Y1) P(X1)
P(X1)
47
Example: Distribute Evidence (cont.)
• Pass message from X1,X2 to X2,Y2:– 1. Projection:
– 2. Absorption:
X 2 X1,X 2 P(X1,X2)P(X2)X1
{X1,X 2} X 2
X 2,Y 2 X 2,Y 2X 2
X 2old P(Y2 | X2) P(X2)
1P(Y2,X2)
X1,X2X1,Y1 X2,Y2X1 X2
48
Example: Inference with evidence
• Assume we want to compute: P(X2|Y1=0,Y2=1) (state estimation)
• Assign likelihoods to the potential functions during initialization:
X1,Y1 0 if Y11
P(X1,Y10) if Y10
X 2,Y 2 0 if Y20
P(Y21 | X2) if Y21
49
Example: Inference with evidence (cont.)
• Repeating the same steps as in the previous case, we obtain:
X1,Y1 0 if Y11
P(X1,Y10,Y21) if Y10
X1 P(X1,Y10,Y21)X1,X 2 P(X1,Y10,X2,Y21)X 2 P(Y10,X2,Y21)
X 2,Y 2 0 if Y20
P(Y10,X2,Y21) if Y21
50
Variable EliminationGeneral idea:• Write query in the form
• Iteratively– Move all irrelevant terms outside of innermost sum– Perform innermost sum, getting a new term– Insert the new term into the product
}\{
)|(),(nxX i
iin paxPXP e
51
x
kxkx yyxfyyf ),,,('),,( 11
m
ilikx i
yyxfyyxf1
,1,1,11 ),,(),,,('
Complexity of variable elimination
• Suppose in one elimination step we compute
This requires • multiplications
• additions
Complexity is exponential in number of variables in the intermediate factor
i
iYXm )Val()Val(
i
iYX )Val()Val(
52
Chordal Graphs
• elimination ordering undirected chordal graph
Graph:• Maximal cliques are factors in elimination• Factors in elimination are cliques in the graph• Complexity is exponential in size of the largest
clique in graph
LT
A B
X
V S
D
V S
LT
A B
X D
53
Induced Width• The size of the largest clique in the induced
graph is thus an indicator for the complexity of variable elimination
• This quantity is called the induced width of a graph according to the specified ordering
• Finding a good ordering for a graph is equivalent to finding the minimal induced width of the graph
54
Properties of Junction Trees
• In every junction tree:– For each cluster (or sepset) ,
– The probability distribution of any variable , using any cluster (or sepset) that contains
X)(XX P
VX V
}\{
)(V
VPX
X
55
Exact inference Using Junction Trees
• Undirected tree• Each node is a cluster • Running intersection property:
– Given two clusters and , all clusters on the path between and contain
• Separator sets (sepsets): – Intersection of adjacent clusters
X YXY YX
ADEABD DEFAD DE
Cluster ABDSepset DE
56
Constructing Junction Trees
Marrying ParentsX4
X6
X5X3
X2
X1
57
Moral GraphX4
X6
X5X3
X2
X1
58
TriangulationX4
X6
X5X3
X2
X1
59
Identify CliquesX4
X6
X5X3
X2
X1
X2X5X6X1X2X3
X2X3X5 X2X4
60
Junction Tree
• Junction tree is a subgraph of the clique graph satisfying the running intersection property
X1X2X3 X2X5X6X2X3X5X2X3 X2X5
X2
X2X5X6
X2X4
X1X2X3
X2X3X5 X2X4
61
Constructing Junction Trees
DAG
Moral Graph
Triangulated Graph
Junction Tree
Identify Cliques
62
Sequential Approach (Example)
• Lower bound for medical diagnosis ex: j j
jj
jj q
zafqzaf )()(
jij
ij
ijijij
j ij
jijiij
ij jiji
ij ijij
fdq
fdq
qd
fq
df
d
i
e
e
e
edfP
)()1()(
)(
)(
0|
0|
|0|
)(0
)( 0
1)|1(