UIUC CS 598: Section EA Graphical Models Deepak Ramachandran Fall 2004 (Based on slides by Eyal Amir...
-
Upload
morris-sanders -
Category
Documents
-
view
218 -
download
1
Transcript of UIUC CS 598: Section EA Graphical Models Deepak Ramachandran Fall 2004 (Based on slides by Eyal Amir...
UIUC CS 598: Section EAGraphical Models
Deepak Ramachandran
Fall 2004
(Based on slides by Eyal Amir (which were based on slides by Lise Getoor and Alvaro Cardenas (UMD) (in turn based on slides by Nir Friedman (Hebrew U)))
Today
1. Probabilistic graphical models
2. Inference
3. Junction Trees
Independent Random Variables
• Two variables X and Y are independent if– P(X = x|Y = y) = P(X = x) for all values x,y– That is, learning the values of Y does not
change prediction of X
• If X and Y are independent then – P(X,Y) = P(X|Y)P(Y) = P(X)P(Y)
• In general, if X1,…,Xp are independent, then P(X1,…,Xp)= P(X1)...P(Xp)
– Requires O(n) parameters
Bayes’ Theorem
Gives us a way to compute Posterior Probabilities:
P(X|Y)P(Y) = P(X, Y) = P(Y|X)P(X)
Therefore, P(X|Y) = P(Y|X)P(X)
P(Y)
P(X) – prior , P(Y) – evidence, P(X|Y)- posterior
Conditional Independence
• Unfortunately, most random variables of interest are not independent of each other
• A more suitable notion is that of conditional independence
• Two variables X and Y are conditionally independent given Z if– P(X = x|Y = y,Z=z) = P(X = x|Z=z) for all values x,y,z– That is, learning the values of Y does not change
prediction of X once we know the value of Z
– notation: I ( X , Y | Z )
Modeling assumptions:Ancestors can effect descendants' genotype only by passing genetic materials through intermediate generations
Example: Family trees
Noisy stochastic process:
Example: Pedigree
• A node represents an individual’sgenotype
Homer
Bart
Marge
Lisa Maggie
Markov Assumption• We now make this
independence assumption more precise for directed acyclic graphs (DAGs)
• Each random variable X, is independent of its non-descendents, given its parents Pa(X)
• Formally,I (X, NonDesc(X) | Pa(X))
Descendent
Ancestor
Parent
Non-descendent
X
Y1 Y2
Non-descendent
Markov Assumption Example
• In this example:– I ( E, B )– I ( B, {E, R} )– I ( R, {A, B, C} | E )– I ( A, R | B,E )– I ( C, {B, E, R} | A)
Earthquake
Radio
Burglary
Alarm
Call
I-Maps• A DAG G is an I-Map of a distribution P if all
Markov assumptions implied by G are satisfied by P(Assuming G and P both use the same set of random variables)
Examples:
X Y
x y P(x,y)0 0 0.250 1 0.251 0 0.251 1 0.25
X Y
x y P(x,y)0 0 0.20 1 0.31 0 0.41 1 0.1
Factorization
• Given that G is an I-Map of P, can we simplify the representation of P?
• Example:
• Since I(X,Y), we have that P(X|Y) = P(X)
• Applying the chain ruleP(X,Y) = P(X|Y) P(Y) = P(X) P(Y)
• Thus, we have a simpler representation of P(X,Y)
X Y
Factorization Theorem
From assumption: )X(NonDesc)X(Pa}X,X{
}X,X{)X(Pa
ii1i,1
1i,1i
i
iip1 ))X(Pa|X(P)X,...,X(P
Thm: if G is an I-Map of P, then
i
1i1ip1 )X,...,X|X(P)X,...,X(PProof:• By chain rule:
• wlog. X1,…,Xp is an ordering consistent with G
• Since G is an I-Map, I (Xi, NonDesc(Xi)| Pa(Xi))
• We conclude, P(Xi | X1,…,Xi-1) = P(Xi | Pa(Xi) )
))X(Pa|)X(Pa}X,X{,X(I ii1i,1i • Hence,
Factorization Example
P(C,A,R,E,B) = P(B)P(E|B)P(R|E,B)P(A|R,B,E)P(C|A,R,B,E)
Earthquake
Radio
Burglary
Alarm
Call
versusP(C,A,R,E,B) = P(B) P(E) P(R|E) P(A|B,E) P(C|A)
Consequences
• We can write P in terms of “local” conditional probabilities
If G is sparse,– that is, |Pa(Xi)| < k ,
each conditional probability can be specified compactly– e.g. for binary variables, these require O(2k) params.
representation of P is compact– linear in number of variables
Summary
We defined the following concepts• The Markov Independences of a DAG G
– I (Xi , NonDesc(Xi) | Pai )
• G is an I-Map of a distribution P– If P satisfies the Markov independencies implied by G
We proved the factorization theorem• if G is an I-Map of P, then
i
iin1 )Pa|X(P)X,...,X(P
• Let Markov(G) be the set of Markov Independencies implied by G
• The factorization theorem shows
G is an I-Map of P
• We can also show the opposite:
Thm:
G is an I-Map of P
Conditional Independencies
i
iin PaXPXXP )|(),...,( 1
i
iin PaXPXXP )|(),...,( 1
Proof (Outline)
Example:X
Y
Z
)|()()|()|()(
),(),,(
),|(XYPXP
XZPXYPXPYXPZYXP
YXZP
)|( XZP
Implied Independencies
• Does a graph G imply additional independencies as a consequence of Markov(G)?
• We can define a logic of independence statements
• Some axioms:– I( X ; Y | Z ) I( Y; X | Z )
– I( X ; Y1, Y2 | Z ) I( X; Y1 | Z )
d-seperation
• A procedure d-sep(X; Y | Z, G) that given a DAG G, and sets X, Y, and Z returns either yes or no
• Goal: d-sep(X; Y | Z, G) = yes iff I(X;Y|Z) follows
from Markov(G)
Paths
• Intuition: dependency must “flow” along paths in the graph
• A path is a sequence of neighboring variables
Examples:• R E A B• C A E R
Earthquake
Radio
Burglary
Alarm
Call
Paths
• We want to know when a path is– active -- creates dependency between end
nodes– blocked -- cannot create dependency end
nodes
• We want to classify situations in which paths are active.
Blocked Unblocked
E
R A
E
R A
Path Blockage
Three cases:– Common cause
–
–
Blocked Active
Blocked Unblocked
E
C
A
E
C
A
Path Blockage
Three cases:– Common cause
– Intermediate cause
–
Blocked Active
Blocked Unblocked
E B
A
C
E B
A
CE B
A
C
Path Blockage
Three cases:– Common cause
– Intermediate cause
– Common Effect
Blocked Active
Path Blockage -- General Case
A path is active, given evidence Z, if• Whenever we have the configuration
B or one of its descendents are in Z
• A node in any other configuration is not in Z
A path is blocked, given evidence Z, if it is not active.
A C
B
d-Separation
• X is d-separated from Y, given Z, if all paths from a node in X to a node in Y are blocked, given Z.
• Checking d-separation can be done efficiently (linear time in number of edges)– Bottom-up phase:
Mark all nodes whose descendents are in Z– X to Y phase:
Traverse (BFS) all edges on paths from X to Y and check if they are blocked
A
– d-sep(R,B)?
Example
E B
C
R
– d-sep(R,B) = yes– d-sep(R,B|A)?
Example
E B
A
C
R
– d-sep(R,B) = yes– d-sep(R,B|A) = no– d-sep(R,B|E,A)?
Example
E B
A
C
R
Soundness
Thm: If – G is an I-Map of P– d-sep( X; Y | Z, G ) = yes
• then– P satisfies I( X; Y | Z )
Informally: Any independence reported by d-separation is satisfied by underlying distribution
Completeness
Thm: If d-sep( X; Y | Z, G ) = no
• then there is a distribution P such that– G is an I-Map of P– P does not satisfy I( X; Y | Z )
Informally: Any independence not reported by d-separation might be violated by the underlying distribution
• We cannot determine this by examining the graph structure alone
Summary: Structure
• We explored DAGs as a representation of conditional independencies:
– Markov independencies of a DAG
– Tight correspondence between Markov(G) and the factorization defined by G
– d-separation, a sound & complete procedure for computing the consequences of the independencies
– Notion of minimal I-Map
– P-Maps
• This theory is the basis for defining Bayesian networks
Inference
• We now have compact representations of probability distributions:– Bayesian Networks– Markov Networks
• Network describes a unique probability distribution P
• How do we answer queries about P?• We use inference as a name for the process
of computing answers to such queries
Queries: Likelihood
• There are many types of queries we might ask. • Most of these involve evidence
– An evidence e is an assignment of values to a set E variables in the domain
– Without loss of generality E = { Xk+1, …, Xn }
• Simplest query: compute probability of evidence
• This is often referred to as computing the likelihood of the evidence
1x
1 ),,,( )(kx
kxxPP ee
Queries: A posteriori belief• Often we are interested in the conditional
probability of a variable given the evidence
• This is the a posteriori belief in X, given evidence e
• A related task is computing the term P(X, e) – i.e., the likelihood of e and X = x for values of
X
x
xXPxXP
xXP),(
),()|(
ee
e
)(),(
)|(eP
eXPeXP
A posteriori beliefThis query is useful in many cases:
• Prediction: what is the probability of an outcome given the starting condition– Target is a descendent of the evidence
• Diagnosis: what is the probability of disease/fault given symptoms– Target is an ancestor of the evidence
Variable Elimination
General idea:• Write query in the form
k
k
x x x iii
x x xk
paxP
xxPP
3 2
3 2
)|(
),,()( 2
ee
Example
Visit to Asia
Smoking
Lung CancerTuberculosis
Abnormalityin Chest
Bronchitis
X-Ray Dyspnea
• “Asia” network:
V S
LT
A B
X D
),|()|(),|()|()|()|()()(),,,,,,,(
badPaxPltaPsbPslPvtPsPvPdxbaltsvP
• We want to compute P(d)• Need to eliminate: v,s,x,t,l,a,b
Initial factors
“Brute force approach”
P(d) P(v,s,t,l,a,b,x,d)v
s
t
l
a
b
x
Complexity is exponential in the size of the graph (number of variables) = T. N=number of states for each variable
O(NT )
V S
LT
A B
X D
),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP
• We want to compute P(d)• Need to eliminate: v,s,x,t,l,a,bInitial factors
Eliminate: v
Note: fv(t) = P(t)In general, result of elimination is not necessarily a probability term
Compute: v
v vtPvPtf )|()()(
),|()|(),|()|()|()()( badPaxPltaPsbPslPsPtfv
V S
LT
A B
X D
),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP
• We want to compute P(d)• Need to eliminate: s,x,t,l,a,b
• Initial factors
Eliminate: s
Summing on s results in a factor with two arguments fs(b,l)In general, result of elimination may be a function of several variables
Compute: s
s slPsbPsPlbf )|()|()(),(
),|()|(),|()|()|()()( badPaxPltaPsbPslPsPtfv
),|()|(),|(),()( badPaxPltaPlbftf sv
V S
LT
A B
X D
),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP
• We want to compute P(d)• Need to eliminate: x,t,l,a,b
• Initial factors
Eliminate: x
Note: fx(a) = 1 for all values of a !!
Compute: x
x axPaf )|()(
),|()|(),|()|()|()()( badPaxPltaPsbPslPsPtfv
),|()|(),|(),()( badPaxPltaPlbftf sv
),|(),|()(),()( badPltaPaflbftf xsv
V S
LT
A B
X D
),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP
• We want to compute P(d)• Need to eliminate: t,l,a,b
• Initial factors
Eliminate: t
Compute: t
vt ltaPtflaf ),|()(),(
),|()|(),|()|()|()()( badPaxPltaPsbPslPsPtfv
),|()|(),|(),()( badPaxPltaPlbftf sv
),|(),|()(),()( badPltaPaflbftf xsv
),|(),()(),( badPlafaflbf txs
V S
LT
A B
X D
),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP
• We want to compute P(d)• Need to eliminate: l,a,b
• Initial factors
Eliminate: l
Compute: l
tsl laflbfbaf ),(),(),(
),|()|(),|()|()|()()( badPaxPltaPsbPslPsPtfv
),|()|(),|(),()( badPaxPltaPlbftf sv
),|(),|()(),()( badPltaPaflbftf xsv
),|(),()(),( badPlafaflbf txs
),|()(),( badPafbaf xl
V S
LT
A B
X D
),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP
• We want to compute P(d)• Need to eliminate: b
• Initial factors
Eliminate: a,bCompute:
b
aba
xla dbfdfbadpafbafdbf ),()(),|()(),(),(
),|()|(),|()|()|()()( badPaxPltaPsbPslPsPtfv
),|()|(),|(),()( badPaxPltaPlbftf sv
),|(),|()(),()( badPltaPaflbftf xsv
),|()(),( badPafbaf xl),|(),()(),( badPlafaflbf txs
)(),( dfdbf ba
V S
LT
A B
X D
P(v)P(s)P(t | v)P(l | s)P(b | s)P(a | t,l)P(x | a)P(d | a,b)
• Different elimination ordering:• Need to eliminate: a,b,x,t,v,s,l
• Initial factors
ga (l,t,d,b,x)
gb (l,t,d,x,s)
gx (l, t,d,s)
gt (l,t,s,v)
gv (l,d,s)
gs(l,d)
gl (d)
Intermediate factors:
Complexity is exponential in the size of the factors!
Complexity of Inference
Thm:
Computing P(X = x) in a Bayesian network is NP-hard
Not surprising, since we can simulate Boolean gates.
Variable Elimination
• We now understand variable elimination as a sequence of rewriting operations
• Actual computation is done in elimination step
• Computation depends on order of elimination
• Exactly the same computation procedure applies to Markov networks
Approaches to inference
• Exact inference –Inference in Simple Chains–Variable elimination–Clustering / join tree algorithms
• Approximate inference –Stochastic simulation / sampling methods–Markov chain Monte Carlo methods–Mean field theory – on Thursday
Markov Network(Undirected Graphical Models)
• A graph with hyper-edges (multi-vertex edges)
• Every hyper-edge e=(x1…xk) has a potential function fe(x1…xk)
• The probability distribution is
11
11
),...,(.../1
),...,(),...,(
x xn Eeekee
Eeekeen
xxfZ
xxfZXXP
x
kxkx yyxfyyf ),,,('),,( 11
m
ilikx i
yyxfyyxf1
,1,1,11 ),,(),,,('
Complexity of variable elimination
• Suppose in one elimination step we compute
This requires • multiplications
– For each value for x, y1, …, yk, we do m multiplications
• additions
– For each value of y1, …, yk , we do |Val(X)| additionsComplexity is exponential in number of variables in the
intermediate factor
i
iYXm )Val()Val(
i
iYX )Val()Val(
Undirected graph representation• At each stage of the procedure, we have an
algebraic term that we need to evaluate• In general this term is of the form:
where Zi are sets of variables• We now plot a graph where there is undirected edge
X--Y if X,Y are arguments of some factor– that is, if X,Y are in some Zi
• Note: this is the Markov network that describes the probability on the variables we did not eliminate yet
1
)(),,( 1y y i
ikn
fxxP iZ
Constructing the Moral Graph
A
B
D
C
E
G
F
H
Constructing The Moral Graph
• Add undirected edges to all co-parents which are not currently joined –Marrying parents
A
B
D
C
E
G
F
H
Constructing The Moral Graph
• Add undirected edges to all co-parents which are not currently joined –Marrying parents
• Drop the directions of the arcs
A
B
D
C
E
G
F
H
Chordal Graphs
• elimination ordering undirected chordal graph
Graph:• Maximal cliques are factors in elimination• Factors in elimination are cliques in the graph• Complexity is exponential in size of the largest
clique in graph
LT
A B
X
V S
D
V S
LT
A B
X D
Induced Width
• The size of the largest clique in the induced graph is thus an indicator for the complexity of variable elimination
• This quantity is called the induced width of a graph according to the specified ordering
• Finding a good ordering for a graph is equivalent to finding the minimal induced width of the graph
PolyTrees
• A polytree is a network where there is at most one path from one variable to another
Thm:• Inference in a polytree is linear in the
representation size of the network– This assumes tabular CPT representation
A
CB
D E
F G
H
Junction Tree• Why junction tree?
– More efficient for some tasks than variable elimination
– We can avoid cycles if we turn highly-interconnected subsets of the nodes into “supernodes” cluster
• Objective– Compute
• is a value of a variable and is evidence for a set of variable
)|( eEvVP
v V eE
Properties of Junction Tree• An undirected tree• Each node is a cluster (nonempty set)
of variables• Running intersection property:
– Given two clusters and , all clusters on the path between and contain
• Separator sets (sepsets): – Intersection of the adjacent cluster
X YXY YX
ADEABD DEFAD DE
Cluster ABDSepset DE
Properties of Junction Tree
• Belief potentials: – Map each instantiation of clusters or sepsets into a
real number
• Constraints:– Consistency: for each cluster and neighboring
sepset
– The joint distribution
XS
SS\X
X
j
i
j
iPS
XU
)(
Potentials
• Potentials: – Denoted by
• Marginalization– , the marginalization of into X
• Multiplication– , the multiplication of and
:X R {0}X
X\Y
YX YX
Y
YXZ YX
YXZ
Properties of Junction Tree
• If a junction tree satisfies the properties, it follows that:– For each cluster (or sepset) ,
– The probability distribution of any variable , using any cluster (or sepset) that contains
X
)(XX P
VX V
}\{
)(V
VPX
X
Building Junction Trees
DAG
Moral Graph
Triangulated Graph
Junction Tree
Identifying Cliques
Triangulating
• An undirected graph is triangulated iff every cycle of length >3 contains an edge to connects two nonadjacent nodes
A
B
D
C
E
G
F
H
Identifying Cliques
• A clique is a subgraph of an undirected graph that is complete and maximal
A
B
D
C
E
G
F
H
EGH
ADEABD
ACEDEF
CEG
Junction Tree
• A junction tree is a subgraph of the clique graph that – is a tree – contains all the cliques– satisfies the running intersection property
EGH
ADEABD
ACEDEF
CEG
ADEABD ACEAD AE CEGCE
DEF
DE
EGH
EG
Principle of Inference
DAG
Junction Tree
Inconsistent Junction Tree
Initialization
Consistent Junction Tree
Propagation
)|( eEvVP
Marginalization
Example: Create Join Tree
X1 X2
Y1 Y2
HMM with 2 time steps:
Junction Tree:
X1,X2X1,Y1 X2,Y2X1 X2
Example: Initialization
VariableAssociated
ClusterPotential function
X1 X1,Y1
Y1 X1,Y1
X2 X1,X2
Y2 X2,Y2
X1,Y1 P(X1)
X1,Y1 P(X1)P(Y1 | X1)
X1,X 2 P(X2 | X1)
X 2,Y 2 P(Y2 | X2)
X1,X2X1,Y1 X2,Y2X1 X2
Example: Collect Evidence
• Choose arbitrary clique, e.g. X1,X2, where all potential functions will be collected.
• Call recursively neighboring cliques for messages:
• 1. Call X1,Y1.– 1. Projection:
– 2. Absorption:
X1 X1,Y1 P(X1,Y1) P(X1)Y1
{X1,Y1} X1
X1,X 2 X1,X 2
X1
X1old
P(X2 | X1)P(X1) P(X1,X2)
Example: Collect Evidence (cont.)
• 2. Call X2,Y2:– 1. Projection:
– 2. Absorption:
X 2 X 2,Y 2 P(Y2 | X2) 1Y 2
{X 2,Y 2} X 2
X1,X2X1,Y1 X2,Y2X1 X2
X1,X 2 X1,X 2
X 2
X 2old
P(X1,X2)
Example: Distribute Evidence
• Pass messages recursively to neighboring nodes
• Pass message from X1,X2 to X1,Y1:– 1. Projection:
– 2. Absorption:
X1 X1,X 2 P(X1,X2) P(X1)X 2
{X1,X 2} X1
X1,Y1 X1,Y1
X1
X1old
P(X1,Y1)P(X1)
P(X1)
Example: Distribute Evidence (cont.)
• Pass message from X1,X2 to X2,Y2:– 1. Projection:
– 2. Absorption:
X 2 X1,X 2 P(X1,X2) P(X2)X1
{X1,X 2} X 2
X 2,Y 2 X 2,Y 2
X 2
X 2old P(Y2 | X2)
P(X2)
1P(Y2,X2)
X1,X2X1,Y1 X2,Y2X1 X2
Example: Inference with evidence
• Assume we want to compute: P(X2|Y1=0,Y2=1) (state estimation)
• Assign likelihoods to the potential functions during initialization:
X1,Y1 0 if Y11
P(X1,Y10) if Y10
X 2,Y 2 0 if Y2 0
P(Y2 1 | X2) if Y2 1
Example: Inference with evidence (cont.)
• Repeating the same steps as in the previous case, we obtain:
X1,Y1 0 if Y11
P(X1,Y10,Y2 1) if Y10
X1 P(X1,Y10,Y2 1)
X1,X 2 P(X1,Y10,X2,Y2 1)
X 2 P(Y10,X2,Y2 1)
X 2,Y 2 0 if Y2 0
P(Y10,X2,Y2 1) if Y2 1