. Inference I Introduction, Hardness, and Variable Elimination Slides by Nir Friedman.

.

Inference IIntroduction, Hardness, and

Variable EliminationSlides by Nir Friedman

In previous lessons we introduced compact representations of probability distributions:

Bayesian Networks Markov Networks

A network describes a unique probability distribution P

How do we answer queries about P?

We use inference as a name for the process of computing answers to such queries

Queries: Likelihood

There are many types of queries we might ask. Most of these involve evidence

An evidence e is an assignment of values to a set E variables in the domain

Without loss of generality E = { Xk+1, …, Xn } Simplest query: compute probability of evidence

This is often referred to as computing the likelihood of the evidence

1x

1 ),,,( )(kx

kxxPP ee

Queries: A posteriori belief Often we are interested in the conditional

probability of a variable given the evidence

This is the a posteriori belief in X, given evidence e

A related task is computing the term P(X, e) i.e., the likelihood of e and X = x for values of

X we can recover the a posteriori belief by

x

xXPxXP

xXP),(

),()|(

ee

e

)(),(

)|(eP

eXPeXP

A posteriori belief

This query is useful in many cases: Prediction: what is the probability of an outcome given

the starting condition Target is a descendent of the evidence

Diagnosis: what is the probability of disease/fault given symptoms

Target is an ancestor of the evidence

As we shall see, the direction between variables does not restrict the directions of the queries

Probabilistic inference can combine evidence form all parts of the network

Queries: A posteriori joint

In this query, we are interested in the conditional probability of several variables, given the evidence

P(X, Y, … | e )

Note that the size of the answer to query is exponential in the number of variables in the joint

Queries: MAP

In this query we want to find the maximum a posteriori assignment for some variable of interest (say X1,…,Xl )

That is, x1,…,xl maximize the probability

P(x1,…,xl | e)

Note that this is equivalent to maximizing

P(x1,…,xl, e)

Queries: MAP

We can use MAP for: Classification

find most likely label, given the evidence Explanation

What is the most likely scenario, given the evidence

Queries: MAP

Cautionary note: The MAP depends on the set of variables Example:

MAP of X is 1, MAP of (X, Y) is (0,0)

x y P(x,y)

0 0 0.35

0 1 0.05

1 0 0.3

1 1 0.3

Complexity of Inference

Thm:

Computing P(X = x) in a Bayesian network is NP-hard

Not surprising, since we can simulate Boolean gates.

Proof

We reduce 3-SAT to Bayesian network computation

Assume we are given a 3-SAT problem: q1,…,qn be propositions,

1 ,... ,k be clauses, such that i = li1 li2 li3

where each lij is a literal over q1,…,qn

= 1... k

We will construct a network s.t. P(X=t) > 0 iff is satisfiable

...

P(Qi = true) = 0.5,

P(I = true | Qi , Qj , Ql ) = 1 iff Qi , Qj , Ql satisfy the clause I

A1, A2, …, are simple binary or gates

...

1

Q1 Q3Q2 Q4 Qn

2 3 k

A1

... k-1

A2 XAk/2-1

It is easy to check Polynomial number of variables Each CPDs can be described by a small table (8

parameters at most) P(X = true) > 0 if and only if there exists a satisfying

assignment to Q1,…,Qn

Conclusion: polynomial reduction of 3-SAT

Note: this construction also shows that computing P(X = t) is harder than NP

2nP(X = t) is the number of satisfying assignments to

Thus, it is #P-hard (in fact it is #P-complete)

Hardness - Notes

We used deterministic relations in our construction The same construction works if we use (1-, )

instead of (1,0) in each gate for any < 0.5 Hardness does not mean we cannot solve

inference It implies that we cannot find a general

procedure that works efficiently for all networks For particular families of networks, we can have

provably efficient procedures We will characterize such families in the next

two classes

Inference in Simple Chains

How do we compute P(X2)?

X1 X2

11

)|()(),()( 121212xx

xxPxPxxPxP

Inference in Simple Chains (cont.)

How do we compute P(X3)?

we already know how to compute P(X2)...

X1 X2

22

)|()(),()( 232323xx

xxPxPxxPxP

X3

11

)|()(),()( 121212xx

xxPxPxxPxP


How do we compute P(Xn)?

Compute P(X1), P(X2), P(X3), … We compute each term by using the previous one

Complexity: Each step costs O(|Val(Xi)|*|Val(Xi+1)|) operations Compare to naïve evaluation, that requires summing

over joint values of n-1 variables

ix

iiii xxPxPxP )|()()( 11

X1 X2 X3Xn

...


Suppose that we observe the value of X2 =x2

How do we compute P(X1|x2)? Recall that it suffices to compute P(X1,x2)

X1 X2

)()|(),( 11221 xPxxPxxP


Suppose that we observe the value of X3 =x3

How do we compute P(X1,x3)?

How do we compute P(x3|x1)?

X1 X2

)|()(),( 13131 xxPxPxxP

X3

2

22

)|()|(

),|()|()|,()|(

2312

2131213213

x

xx

xxPxxP

xxxPxxPxxxPxxP


Suppose that we observe the value of Xn =xn

How do we compute P(X1,xn)?

We compute P(xn|xn-1), P(xn|xn-2), … iteratively

X1 X2

)|()(),( 111 xxPxPxxP nn

X3

i

i

xinii

xiniin

xxPxxP

xxxPxxP

)|()|(

)|,()|(

11

11

Xn...


Suppose that we observe the value of Xn =xn

We want to find P(Xk|xn )

How do we compute P(Xk,xn )?

We compute P(Xk ) by forward iterations

We compute P(xn | Xk ) by backward iterations

X1 X2

)|()(),( knknk xxPxPxxP

Xk Xn......

Elimination in Chains

We now try to understand the simple chain example using first-order principles

Using definition of probability, we have

d c b a

edcbaPeP ),,,,()(

A B C ED


By chain decomposition, we get

A B C ED

d c b a

d c b a

dePcdPbcPabPaP

edcbaPeP

)|()|()|()|()(

),,,,()(


Rearranging terms ...

A B C ED

d c b a

d c b a

abPaPdePcdPbcP

dePcdPbcPabPaPeP

)|()()|()|()|(

)|()|()|()|()()(


Now we can perform innermost summation

This summation, is exactly the first step in the forward iteration we describe before

A B C ED

d c b

d c b a

bpdePcdPbcP

abPaPdePcdPbcPeP

)()|()|()|(

)|()()|()|()|()(

X


Rearranging and then summing again, we get

A B C ED

d c

d c b

d c b

cpdePcdP

bpbcPdePcdP

bpdePcdPbcPeP

)()|()|(

)()|()|()|(

)()|()|()|()(

X X

Elimination in Chains with Evidence

Similarly, we understand the backward pass

We write the query in explicit form

A B C ED

b c d

b c d

dePcdPbcPabPaP

edcbaPeaP

)|()|()|()|()(

),,,,(),(


Eliminating d, we get

A B C ED

b c

b c d

b c d

cePbcPabPaP

dePcdPbcPabPaP

dePcdPbcPabPaPeaP

)|()|()|()(

)|()|()|()|()(

)|()|()|()|()(),(

X


Eliminating c, we get

A B C ED

b

b c

b c

bepabPaP

cePbcPabPaP

cePbcPabPaPeaP

)|()|()(

)|()|()|()(

)|()|()|()(),(

XX


Finally, we eliminate b

A B C ED

)|()(

)|()|()(

)|()|()(),(

aePaP

bepabPaP

bepabPaPeaP

b

b

XXX

Variable Elimination

General idea: Write query in the form

Iteratively Move all irrelevant terms outside of innermost

sum Perform innermost sum, getting a new term Insert the new term into the product

3 2

1( , ) ( | )k

i ix x x i

P X P x pa e

A More Complex Example

Visit to Asia

Smoking

Lung CancerTuberculosis

Abnormalityin Chest

Bronchitis

X-Ray Dyspnea

“Asia” network:

V S

LT

A B

X D

),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP

We want to compute P(d) Need to eliminate: v,s,x,t,l,a,b

Initial factors

V S

LT

A B

X D


We want to compute P(d) Need to eliminate: v,s,x,t,l,a,b

Initial factors

Eliminate: v

Note: fv(t) = P(t)In general, result of elimination is not necessarily a probability term

Compute: v

v vtPvPtf )|()()(

),|()|(),|()|()|()()( badPaxPltaPsbPslPsPtfv

V S

LT

A B

X D


We want to compute P(d) Need to eliminate: s,x,t,l,a,b

Initial factors

Eliminate: s

Summing on s results in a factor with two arguments fs(b,l)In general, result of elimination may be a function of several variables

Compute: s

s slPsbPsPlbf )|()|()(),(


),|()|(),|(),()( badPaxPltaPlbftf sv

V S

LT

A B

X D


We want to compute P(d) Need to eliminate: x,t,l,a,b

Initial factors

Eliminate: x

Note: fx(a) = 1 for all values of a !!

Compute: x

x axPaf )|()(



),|(),|()(),()( badPltaPaflbftf xsv

V S

LT

A B

X D


We want to compute P(d) Need to eliminate: t,l,a,b

Initial factors

Eliminate: t

Compute: t

vt ltaPtflaf ),|()(),(




),|(),()(),( badPlafaflbf txs

V S

LT

A B

X D


We want to compute P(d) Need to eliminate: l,a,b

Initial factors

Eliminate: l

Compute: l

tsl laflbfbaf ),(),(),(




),|(),()(),( badPlafaflbf txs

),|()(),( badPafbaf xl

V S

LT

A B

X D


We want to compute P(d) Need to eliminate: b

Initial factors

Eliminate: a,bCompute:

b

aba

xla dbfdfbadpafbafdbf ),()(),|()(),(),(




),|()(),( badPafbaf xl),|(),()(),( badPlafaflbf txs

)(),( dfdbf ba

Variable Elimination

We now understand variable elimination as a sequence of rewriting operations

Actual computation is done in elimination step

Exactly the same computation procedure applies to Markov networks

Computation depends on order of elimination We will return to this issue in detail

Dealing with evidence

How do we deal with evidence?

Suppose get evidence V = t, S = f, D = t We want to compute P(L, V = t, S = f, D = t)

V S

LT

A B

X D

Dealing with Evidence

We start by writing the factors:

Since we know that V = t, we don’t need to eliminate V Instead, we can replace the factors P(V) and P(T|V) with

These “select” the appropriate parts of the original factors given the evidence

Note that fp(V) is a constant, and thus does not appear in elimination of other variables


V S

LT

A B

X D

)|()()( )|()( tVTPTftVPf VTpVP

Dealing with Evidence Given evidence V = t, S = f, D = t Compute P(L, V = t, S = f, D = t ) Initial factors, after setting evidence:

),()|(),|()()()( ),|()|()|()|()()( bafaxPltaPbflftfff badPsbPslPvtPsPvP

V S

LT

A B

X D


Eliminating x, we get


V S

LT

A B

X D

),()(),|()()()( ),|()|()|()|()()( bafafltaPbflftfff badPxsbPslPvtPsPvP



Eliminating t, we get


V S

LT

A B

X D


),()(),()()( ),|()|()|()()( bafaflafbflfff badPxtsbPslPsPvP




Eliminating a, we get


V S

LT

A B

X D



),()()( )|()|()()( lbfbflfff asbPslPsPvP




Eliminating a, we get

Eliminating b, we get


V S

LT

A B

X D



),()()( )|()|()()( lbfbflfff asbPslPsPvP

)()()|()()( lflfff bslPsPvP

x

kxkx yyxfyyf ),,,('),,( 11

m

ilikx i

yyxfyyxf1

,1,1,11 ),,(),,,('

Complexity of variable elimination

Suppose in one elimination step we compute

This requires multiplications

For each value for x, y1, …, yk, we do m multiplications additions

For each value of y1, …, yk , we do |Val(X)| additions

Complexity is exponential in number of variables in the intermediate factor.

i

iYXm )Val()Val(

i

iYX )Val()Val(

Understanding Variable Elimination

We want to select “good” elimination orderings that reduce complexity

We start by attempting to understand variable elimination via the graph we are working with

This will reduce the problem of finding good ordering to graph-theoretic operation that is well-understood

Undirected graph representation

At each stage of the procedure, we have an algebraic term that we need to evaluate

In general this term is of the form:

where Zi are sets of variables

We now plot a graph where there is undirected edge X--Y if X,Y are arguments of some factor

that is, if X,Y are in some Zi

Note: this is the Markov network that describes the probability on the variables we did not eliminate yet

1

)(),,( 1y y i

ikn

fxxP iZ

Undirected Graph Representation Consider the “Asia” example The initial factors are

thus, the undirected graph is

In the first step this graph is just the moralized graph


V S

LT

A B

X D

V S

LT

A B

X D

Undirected Graph Representation

Now we eliminate t, getting

The corresponding change in the graph is


),,(),|()|()|()|()()( lavfbadPaxPsbPslPsPvP t

V S

LT

A B

X D

V S

LT

A B

X D

Example

Want to compute P(L, V = t, S = f, D = t)

Moralizing

V S

LT

A B

X D

LT

A B

X

V S

D

Example


Moralizing Setting evidence

V S

LT

A B

X D

LT

A B

X

V S

D

Example


Moralizing Setting evidence Eliminating x

New factor fx(A)

V S

LT

A B

X D

LT

A B

X

V S

D

Example


Moralizing Setting evidence Eliminating x Eliminating a

New factor fa(b,t,l)

V S

LT

A B

X D

LT

A B

X

V S

D

Example


Moralizing Setting evidence Eliminating x Eliminating a Eliminating b

New factor fb(t,l)

V S

LT

A B

X D

LT

A B

X

V S

D

Example


Moralizing Setting evidence Eliminating x Eliminating a Eliminating b Eliminating t

New factor ft(l)

V S

LT

A B

X D

LT

A B

X

V S

D

Elimination in Undirected Graphs

Generalizing, we see that we can eliminate a variable x by

1. For all Y,Z, s.t., Y--X, Z--Xadd an edge Y--Z

2. Remove X and all adjacent edges to it This procedures create a clique that contains all the

neighbors of X After step 1 we have a clique that corresponds to

the intermediate factor (before marginlization) The cost of the step is exponential in the size of this

clique

Undirected Graphs

The process of eliminating nodes from an undirected graph gives us a clue to the complexity of inference

To see this, we will examine the graph that contains all of the edges we added during the elimination. The resulting graph is always chordal.

Example

Want to compute P(L)

Moralizing

V S

LT

A B

X D

LT

A B

X

V S

D

Example


Moralizing Eliminating v

Multiply to get f’v(v,t) Result fv(t)

V S

LT

A B

X D

LT

A B

X

V S

D

Example


Moralizing Eliminating v Eliminating x

Multiply to get f’x(a,x) Result fx(a)

V S

LT

A B

X D

LT

A B

X

V S

D

Example


Moralizing Eliminating v Eliminating x Eliminating s

Multiply to get f’s(l,b,s) Result fs(l,b)

V S

LT

A B

X D

LT

A B

X

V S

D

Example

Want to compute P(D)

Moralizing Eliminating v Eliminating x Eliminating s Eliminating t

Multiply to get f’t(a,l,t) Result ft(a,l)

V S

LT

A B

X D

LT

A B

X

V S

D

Example


Moralizing Eliminating v Eliminating x Eliminating s Eliminating t Eliminating l

Multiply to get f’l(a,b,l) Result fl(a,b)

V S

LT

A B

X D

LT

A B

X

V S

D

Example


Moralizing Eliminating v Eliminating x Eliminating s Eliminating t Eliminating l Eliminating a, b

Multiply to get f’a(a,b,d) Result f(d)

V S

LT

A B

X D

LT

A B

X

V S

D

The resulting graph is the inducedgraph (for this particular ordering)

Main property: Every maximal clique in the induced graph

corresponds to a intermediate factor in the computation

Every factor stored during the process is a subset of some maximal clique in the graph

These facts are true for any variable elimination ordering on any network

Expanded Graphs

LT

A B

X

V S

D

Induced Width (Treewidth)

The size of the largest clique in the induced graph is thus an indicator for the complexity of variable elimination

This quantity (minus one) is called the induced width (or treewidth) of a graph according to the specified ordering

Finding a good ordering for a graph is equivalent to finding the minimal induced width of the graph

Consequence: Elimination on Trees

Suppose we have a tree A network where each variable has at most one

parent All the factors involve at most two variables Thus, the moralized graph is also a tree

A

CB

D E

F G

A

CB

D E

F G

Elimination on Trees

We can maintain the tree structure by eliminating extreme variables in the tree

A

CB

D E

F G

A

CB

D E

F G A

CB

D E

F G

Elimination on Trees

Formally, for any tree, there is an elimination ordering with treewidth = 1

Thm Inference on trees is linear in number of variables

PolyTrees

A polytree is a network where there is at most one path from one variable to another

Thm: Inference in a polytree is linear in the

representation size of the network This assumes tabular CPT representation

Can you see how the argument would work?

A

CB

D E

F G

H

General Networks

What do we do when the network is not a polytree? If network has a cycle, the treewidth for any

ordering is greater than 1

Example

Eliminating A, B, C, D, E,…. Resulting graph is chordal with

treewidth 2

A

H

B

D

F

C

E

G

A

H

B

D

F

C

E

G

A

H

B

D

F

C

E

G

A

H

B

D

F

C

E

G

A

H

B

D

F

C

E

G

Example

Eliminating H,G, E, C, F, D, E, A

A

H

B

D

F

C

E

G

A

H

B

D

F

C

E

G

A

H

B

D

F

C

E

G

A

H

B

D

F

C

E

G

A

H

B

D

F

C

E

G

General Networks

From graph theory:Thm: Finding an ordering that minimizes the treewidth is

NP-HardHowever, There are reasonable heuristic for finding

“relatively” good ordering There are provable approximations to the best

treewidth If the graph has a small treewidth, there are

algorithms that find it in polynomial time

. Inference I Introduction, Hardness, and Variable Elimination Slides by Nir Friedman.

Documents

Transcript of . Inference I Introduction, Hardness, and Variable Elimination Slides by Nir Friedman.