Learning Bayesian Networks with Local Structure by Nir Friedman and Moises Goldszmidt.
. Inference I Introduction, Hardness, and Variable Elimination Slides by Nir Friedman.
-
date post
21-Dec-2015 -
Category
Documents
-
view
225 -
download
1
Transcript of . Inference I Introduction, Hardness, and Variable Elimination Slides by Nir Friedman.
In previous lessons we introduced compact representations of probability distributions:
Bayesian Networks Markov Networks
A network describes a unique probability distribution P
How do we answer queries about P?
We use inference as a name for the process of computing answers to such queries
Queries: Likelihood
There are many types of queries we might ask. Most of these involve evidence
An evidence e is an assignment of values to a set E variables in the domain
Without loss of generality E = { Xk+1, …, Xn } Simplest query: compute probability of evidence
This is often referred to as computing the likelihood of the evidence
1x
1 ),,,( )(kx
kxxPP ee
Queries: A posteriori belief Often we are interested in the conditional
probability of a variable given the evidence
This is the a posteriori belief in X, given evidence e
A related task is computing the term P(X, e) i.e., the likelihood of e and X = x for values of
X we can recover the a posteriori belief by
x
xXPxXP
xXP),(
),()|(
ee
e
)(),(
)|(eP
eXPeXP
A posteriori belief
This query is useful in many cases: Prediction: what is the probability of an outcome given
the starting condition Target is a descendent of the evidence
Diagnosis: what is the probability of disease/fault given symptoms
Target is an ancestor of the evidence
As we shall see, the direction between variables does not restrict the directions of the queries
Probabilistic inference can combine evidence form all parts of the network
Queries: A posteriori joint
In this query, we are interested in the conditional probability of several variables, given the evidence
P(X, Y, … | e )
Note that the size of the answer to query is exponential in the number of variables in the joint
Queries: MAP
In this query we want to find the maximum a posteriori assignment for some variable of interest (say X1,…,Xl )
That is, x1,…,xl maximize the probability
P(x1,…,xl | e)
Note that this is equivalent to maximizing
P(x1,…,xl, e)
Queries: MAP
We can use MAP for: Classification
find most likely label, given the evidence Explanation
What is the most likely scenario, given the evidence
Queries: MAP
Cautionary note: The MAP depends on the set of variables Example:
MAP of X is 1, MAP of (X, Y) is (0,0)
x y P(x,y)
0 0 0.35
0 1 0.05
1 0 0.3
1 1 0.3
Complexity of Inference
Thm:
Computing P(X = x) in a Bayesian network is NP-hard
Not surprising, since we can simulate Boolean gates.
Proof
We reduce 3-SAT to Bayesian network computation
Assume we are given a 3-SAT problem: q1,…,qn be propositions,
1 ,... ,k be clauses, such that i = li1 li2 li3
where each lij is a literal over q1,…,qn
= 1... k
We will construct a network s.t. P(X=t) > 0 iff is satisfiable
...
P(Qi = true) = 0.5,
P(I = true | Qi , Qj , Ql ) = 1 iff Qi , Qj , Ql satisfy the clause I
A1, A2, …, are simple binary or gates
...
1
Q1 Q3Q2 Q4 Qn
2 3 k
A1
... k-1
A2 XAk/2-1
It is easy to check Polynomial number of variables Each CPDs can be described by a small table (8
parameters at most) P(X = true) > 0 if and only if there exists a satisfying
assignment to Q1,…,Qn
Conclusion: polynomial reduction of 3-SAT
Note: this construction also shows that computing P(X = t) is harder than NP
2nP(X = t) is the number of satisfying assignments to
Thus, it is #P-hard (in fact it is #P-complete)
Hardness - Notes
We used deterministic relations in our construction The same construction works if we use (1-, )
instead of (1,0) in each gate for any < 0.5 Hardness does not mean we cannot solve
inference It implies that we cannot find a general
procedure that works efficiently for all networks For particular families of networks, we can have
provably efficient procedures We will characterize such families in the next
two classes
Inference in Simple Chains (cont.)
How do we compute P(X3)?
we already know how to compute P(X2)...
X1 X2
22
)|()(),()( 232323xx
xxPxPxxPxP
X3
11
)|()(),()( 121212xx
xxPxPxxPxP
Inference in Simple Chains (cont.)
How do we compute P(Xn)?
Compute P(X1), P(X2), P(X3), … We compute each term by using the previous one
Complexity: Each step costs O(|Val(Xi)|*|Val(Xi+1)|) operations Compare to naïve evaluation, that requires summing
over joint values of n-1 variables
ix
iiii xxPxPxP )|()()( 11
X1 X2 X3Xn
...
Inference in Simple Chains (cont.)
Suppose that we observe the value of X2 =x2
How do we compute P(X1|x2)? Recall that it suffices to compute P(X1,x2)
X1 X2
)()|(),( 11221 xPxxPxxP
Inference in Simple Chains (cont.)
Suppose that we observe the value of X3 =x3
How do we compute P(X1,x3)?
How do we compute P(x3|x1)?
X1 X2
)|()(),( 13131 xxPxPxxP
X3
2
22
)|()|(
),|()|()|,()|(
2312
2131213213
x
xx
xxPxxP
xxxPxxPxxxPxxP
Inference in Simple Chains (cont.)
Suppose that we observe the value of Xn =xn
How do we compute P(X1,xn)?
We compute P(xn|xn-1), P(xn|xn-2), … iteratively
X1 X2
)|()(),( 111 xxPxPxxP nn
X3
i
i
xinii
xiniin
xxPxxP
xxxPxxP
)|()|(
)|,()|(
11
11
Xn...
Inference in Simple Chains (cont.)
Suppose that we observe the value of Xn =xn
We want to find P(Xk|xn )
How do we compute P(Xk,xn )?
We compute P(Xk ) by forward iterations
We compute P(xn | Xk ) by backward iterations
X1 X2
)|()(),( knknk xxPxPxxP
Xk Xn......
Elimination in Chains
We now try to understand the simple chain example using first-order principles
Using definition of probability, we have
d c b a
edcbaPeP ),,,,()(
A B C ED
Elimination in Chains
By chain decomposition, we get
A B C ED
d c b a
d c b a
dePcdPbcPabPaP
edcbaPeP
)|()|()|()|()(
),,,,()(
Elimination in Chains
Rearranging terms ...
A B C ED
d c b a
d c b a
abPaPdePcdPbcP
dePcdPbcPabPaPeP
)|()()|()|()|(
)|()|()|()|()()(
Elimination in Chains
Now we can perform innermost summation
This summation, is exactly the first step in the forward iteration we describe before
A B C ED
d c b
d c b a
bpdePcdPbcP
abPaPdePcdPbcPeP
)()|()|()|(
)|()()|()|()|()(
X
Elimination in Chains
Rearranging and then summing again, we get
A B C ED
d c
d c b
d c b
cpdePcdP
bpbcPdePcdP
bpdePcdPbcPeP
)()|()|(
)()|()|()|(
)()|()|()|()(
X X
Elimination in Chains with Evidence
Similarly, we understand the backward pass
We write the query in explicit form
A B C ED
b c d
b c d
dePcdPbcPabPaP
edcbaPeaP
)|()|()|()|()(
),,,,(),(
Elimination in Chains with Evidence
Eliminating d, we get
A B C ED
b c
b c d
b c d
cePbcPabPaP
dePcdPbcPabPaP
dePcdPbcPabPaPeaP
)|()|()|()(
)|()|()|()|()(
)|()|()|()|()(),(
X
Elimination in Chains with Evidence
Eliminating c, we get
A B C ED
b
b c
b c
bepabPaP
cePbcPabPaP
cePbcPabPaPeaP
)|()|()(
)|()|()|()(
)|()|()|()(),(
XX
Elimination in Chains with Evidence
Finally, we eliminate b
A B C ED
)|()(
)|()|()(
)|()|()(),(
aePaP
bepabPaP
bepabPaPeaP
b
b
XXX
Variable Elimination
General idea: Write query in the form
Iteratively Move all irrelevant terms outside of innermost
sum Perform innermost sum, getting a new term Insert the new term into the product
3 2
1( , ) ( | )k
i ix x x i
P X P x pa e
A More Complex Example
Visit to Asia
Smoking
Lung CancerTuberculosis
Abnormalityin Chest
Bronchitis
X-Ray Dyspnea
“Asia” network:
V S
LT
A B
X D
),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP
We want to compute P(d) Need to eliminate: v,s,x,t,l,a,b
Initial factors
V S
LT
A B
X D
),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP
We want to compute P(d) Need to eliminate: v,s,x,t,l,a,b
Initial factors
Eliminate: v
Note: fv(t) = P(t)In general, result of elimination is not necessarily a probability term
Compute: v
v vtPvPtf )|()()(
),|()|(),|()|()|()()( badPaxPltaPsbPslPsPtfv
V S
LT
A B
X D
),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP
We want to compute P(d) Need to eliminate: s,x,t,l,a,b
Initial factors
Eliminate: s
Summing on s results in a factor with two arguments fs(b,l)In general, result of elimination may be a function of several variables
Compute: s
s slPsbPsPlbf )|()|()(),(
),|()|(),|()|()|()()( badPaxPltaPsbPslPsPtfv
),|()|(),|(),()( badPaxPltaPlbftf sv
V S
LT
A B
X D
),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP
We want to compute P(d) Need to eliminate: x,t,l,a,b
Initial factors
Eliminate: x
Note: fx(a) = 1 for all values of a !!
Compute: x
x axPaf )|()(
),|()|(),|()|()|()()( badPaxPltaPsbPslPsPtfv
),|()|(),|(),()( badPaxPltaPlbftf sv
),|(),|()(),()( badPltaPaflbftf xsv
V S
LT
A B
X D
),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP
We want to compute P(d) Need to eliminate: t,l,a,b
Initial factors
Eliminate: t
Compute: t
vt ltaPtflaf ),|()(),(
),|()|(),|()|()|()()( badPaxPltaPsbPslPsPtfv
),|()|(),|(),()( badPaxPltaPlbftf sv
),|(),|()(),()( badPltaPaflbftf xsv
),|(),()(),( badPlafaflbf txs
V S
LT
A B
X D
),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP
We want to compute P(d) Need to eliminate: l,a,b
Initial factors
Eliminate: l
Compute: l
tsl laflbfbaf ),(),(),(
),|()|(),|()|()|()()( badPaxPltaPsbPslPsPtfv
),|()|(),|(),()( badPaxPltaPlbftf sv
),|(),|()(),()( badPltaPaflbftf xsv
),|(),()(),( badPlafaflbf txs
),|()(),( badPafbaf xl
V S
LT
A B
X D
),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP
We want to compute P(d) Need to eliminate: b
Initial factors
Eliminate: a,bCompute:
b
aba
xla dbfdfbadpafbafdbf ),()(),|()(),(),(
),|()|(),|()|()|()()( badPaxPltaPsbPslPsPtfv
),|()|(),|(),()( badPaxPltaPlbftf sv
),|(),|()(),()( badPltaPaflbftf xsv
),|()(),( badPafbaf xl),|(),()(),( badPlafaflbf txs
)(),( dfdbf ba
Variable Elimination
We now understand variable elimination as a sequence of rewriting operations
Actual computation is done in elimination step
Exactly the same computation procedure applies to Markov networks
Computation depends on order of elimination We will return to this issue in detail
Dealing with evidence
How do we deal with evidence?
Suppose get evidence V = t, S = f, D = t We want to compute P(L, V = t, S = f, D = t)
V S
LT
A B
X D
Dealing with Evidence
We start by writing the factors:
Since we know that V = t, we don’t need to eliminate V Instead, we can replace the factors P(V) and P(T|V) with
These “select” the appropriate parts of the original factors given the evidence
Note that fp(V) is a constant, and thus does not appear in elimination of other variables
),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP
V S
LT
A B
X D
)|()()( )|()( tVTPTftVPf VTpVP
Dealing with Evidence Given evidence V = t, S = f, D = t Compute P(L, V = t, S = f, D = t ) Initial factors, after setting evidence:
),()|(),|()()()( ),|()|()|()|()()( bafaxPltaPbflftfff badPsbPslPvtPsPvP
V S
LT
A B
X D
Dealing with Evidence Given evidence V = t, S = f, D = t Compute P(L, V = t, S = f, D = t ) Initial factors, after setting evidence:
Eliminating x, we get
),()|(),|()()()( ),|()|()|()|()()( bafaxPltaPbflftfff badPsbPslPvtPsPvP
V S
LT
A B
X D
),()(),|()()()( ),|()|()|()|()()( bafafltaPbflftfff badPxsbPslPvtPsPvP
Dealing with Evidence Given evidence V = t, S = f, D = t Compute P(L, V = t, S = f, D = t ) Initial factors, after setting evidence:
Eliminating x, we get
Eliminating t, we get
),()|(),|()()()( ),|()|()|()|()()( bafaxPltaPbflftfff badPsbPslPvtPsPvP
V S
LT
A B
X D
),()(),|()()()( ),|()|()|()|()()( bafafltaPbflftfff badPxsbPslPvtPsPvP
),()(),()()( ),|()|()|()()( bafaflafbflfff badPxtsbPslPsPvP
Dealing with Evidence Given evidence V = t, S = f, D = t Compute P(L, V = t, S = f, D = t ) Initial factors, after setting evidence:
Eliminating x, we get
Eliminating t, we get
Eliminating a, we get
),()|(),|()()()( ),|()|()|()|()()( bafaxPltaPbflftfff badPsbPslPvtPsPvP
V S
LT
A B
X D
),()(),|()()()( ),|()|()|()|()()( bafafltaPbflftfff badPxsbPslPvtPsPvP
),()(),()()( ),|()|()|()()( bafaflafbflfff badPxtsbPslPsPvP
),()()( )|()|()()( lbfbflfff asbPslPsPvP
Dealing with Evidence Given evidence V = t, S = f, D = t Compute P(L, V = t, S = f, D = t ) Initial factors, after setting evidence:
Eliminating x, we get
Eliminating t, we get
Eliminating a, we get
Eliminating b, we get
),()|(),|()()()( ),|()|()|()|()()( bafaxPltaPbflftfff badPsbPslPvtPsPvP
V S
LT
A B
X D
),()(),|()()()( ),|()|()|()|()()( bafafltaPbflftfff badPxsbPslPvtPsPvP
),()(),()()( ),|()|()|()()( bafaflafbflfff badPxtsbPslPsPvP
),()()( )|()|()()( lbfbflfff asbPslPsPvP
)()()|()()( lflfff bslPsPvP
x
kxkx yyxfyyf ),,,('),,( 11
m
ilikx i
yyxfyyxf1
,1,1,11 ),,(),,,('
Complexity of variable elimination
Suppose in one elimination step we compute
This requires multiplications
For each value for x, y1, …, yk, we do m multiplications additions
For each value of y1, …, yk , we do |Val(X)| additions
Complexity is exponential in number of variables in the intermediate factor.
i
iYXm )Val()Val(
i
iYX )Val()Val(
Understanding Variable Elimination
We want to select “good” elimination orderings that reduce complexity
We start by attempting to understand variable elimination via the graph we are working with
This will reduce the problem of finding good ordering to graph-theoretic operation that is well-understood
Undirected graph representation
At each stage of the procedure, we have an algebraic term that we need to evaluate
In general this term is of the form:
where Zi are sets of variables
We now plot a graph where there is undirected edge X--Y if X,Y are arguments of some factor
that is, if X,Y are in some Zi
Note: this is the Markov network that describes the probability on the variables we did not eliminate yet
1
)(),,( 1y y i
ikn
fxxP iZ
Undirected Graph Representation Consider the “Asia” example The initial factors are
thus, the undirected graph is
In the first step this graph is just the moralized graph
),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP
V S
LT
A B
X D
V S
LT
A B
X D
Undirected Graph Representation
Now we eliminate t, getting
The corresponding change in the graph is
),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP
),,(),|()|()|()|()()( lavfbadPaxPsbPslPsPvP t
V S
LT
A B
X D
V S
LT
A B
X D
Example
Want to compute P(L, V = t, S = f, D = t)
Moralizing Setting evidence
V S
LT
A B
X D
LT
A B
X
V S
D
Example
Want to compute P(L, V = t, S = f, D = t)
Moralizing Setting evidence Eliminating x
New factor fx(A)
V S
LT
A B
X D
LT
A B
X
V S
D
Example
Want to compute P(L, V = t, S = f, D = t)
Moralizing Setting evidence Eliminating x Eliminating a
New factor fa(b,t,l)
V S
LT
A B
X D
LT
A B
X
V S
D
Example
Want to compute P(L, V = t, S = f, D = t)
Moralizing Setting evidence Eliminating x Eliminating a Eliminating b
New factor fb(t,l)
V S
LT
A B
X D
LT
A B
X
V S
D
Example
Want to compute P(L, V = t, S = f, D = t)
Moralizing Setting evidence Eliminating x Eliminating a Eliminating b Eliminating t
New factor ft(l)
V S
LT
A B
X D
LT
A B
X
V S
D
Elimination in Undirected Graphs
Generalizing, we see that we can eliminate a variable x by
1. For all Y,Z, s.t., Y--X, Z--Xadd an edge Y--Z
2. Remove X and all adjacent edges to it This procedures create a clique that contains all the
neighbors of X After step 1 we have a clique that corresponds to
the intermediate factor (before marginlization) The cost of the step is exponential in the size of this
clique
Undirected Graphs
The process of eliminating nodes from an undirected graph gives us a clue to the complexity of inference
To see this, we will examine the graph that contains all of the edges we added during the elimination. The resulting graph is always chordal.
Example
Want to compute P(L)
Moralizing Eliminating v
Multiply to get f’v(v,t) Result fv(t)
V S
LT
A B
X D
LT
A B
X
V S
D
Example
Want to compute P(L)
Moralizing Eliminating v Eliminating x
Multiply to get f’x(a,x) Result fx(a)
V S
LT
A B
X D
LT
A B
X
V S
D
Example
Want to compute P(L)
Moralizing Eliminating v Eliminating x Eliminating s
Multiply to get f’s(l,b,s) Result fs(l,b)
V S
LT
A B
X D
LT
A B
X
V S
D
Example
Want to compute P(D)
Moralizing Eliminating v Eliminating x Eliminating s Eliminating t
Multiply to get f’t(a,l,t) Result ft(a,l)
V S
LT
A B
X D
LT
A B
X
V S
D
Example
Want to compute P(D)
Moralizing Eliminating v Eliminating x Eliminating s Eliminating t Eliminating l
Multiply to get f’l(a,b,l) Result fl(a,b)
V S
LT
A B
X D
LT
A B
X
V S
D
Example
Want to compute P(D)
Moralizing Eliminating v Eliminating x Eliminating s Eliminating t Eliminating l Eliminating a, b
Multiply to get f’a(a,b,d) Result f(d)
V S
LT
A B
X D
LT
A B
X
V S
D
The resulting graph is the inducedgraph (for this particular ordering)
Main property: Every maximal clique in the induced graph
corresponds to a intermediate factor in the computation
Every factor stored during the process is a subset of some maximal clique in the graph
These facts are true for any variable elimination ordering on any network
Expanded Graphs
LT
A B
X
V S
D
Induced Width (Treewidth)
The size of the largest clique in the induced graph is thus an indicator for the complexity of variable elimination
This quantity (minus one) is called the induced width (or treewidth) of a graph according to the specified ordering
Finding a good ordering for a graph is equivalent to finding the minimal induced width of the graph
Consequence: Elimination on Trees
Suppose we have a tree A network where each variable has at most one
parent All the factors involve at most two variables Thus, the moralized graph is also a tree
A
CB
D E
F G
A
CB
D E
F G
Elimination on Trees
We can maintain the tree structure by eliminating extreme variables in the tree
A
CB
D E
F G
A
CB
D E
F G A
CB
D E
F G
Elimination on Trees
Formally, for any tree, there is an elimination ordering with treewidth = 1
Thm Inference on trees is linear in number of variables
PolyTrees
A polytree is a network where there is at most one path from one variable to another
Thm: Inference in a polytree is linear in the
representation size of the network This assumes tabular CPT representation
Can you see how the argument would work?
A
CB
D E
F G
H
General Networks
What do we do when the network is not a polytree? If network has a cycle, the treewidth for any
ordering is greater than 1
Example
Eliminating A, B, C, D, E,…. Resulting graph is chordal with
treewidth 2
A
H
B
D
F
C
E
G
A
H
B
D
F
C
E
G
A
H
B
D
F
C
E
G
A
H
B
D
F
C
E
G
A
H
B
D
F
C
E
G
Example
Eliminating H,G, E, C, F, D, E, A
A
H
B
D
F
C
E
G
A
H
B
D
F
C
E
G
A
H
B
D
F
C
E
G
A
H
B
D
F
C
E
G
A
H
B
D
F
C
E
G
General Networks
From graph theory:Thm: Finding an ordering that minimizes the treewidth is
NP-HardHowever, There are reasonable heuristic for finding
“relatively” good ordering There are provable approximations to the best
treewidth If the graph has a small treewidth, there are
algorithms that find it in polynomial time