Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact...
Transcript of Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact...
![Page 1: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/1.jpg)
Graphical Modelsand
Exact Inference Eric Xing
Lecture 17, November 7, 2016
Machine Learning
10-701, Fall 2016
Reading: Chap. 8, C.B book
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
X1 X2
X3 X4 X5
X6
X7 X8
1© Eric Xing @ CMU, 2006-2016
![Page 2: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/2.jpg)
Representation: what is the joint probability dist. on multiple variables?
How many state configurations in total? --- 28
Are they all needed to be represented? Do we get any scientific/medical insight?
Learning: where do we get all this probabilities? Maximal-likelihood estimation? but how many data do we need? Are there other est. principles? Where do we put domain knowledge in terms of plausible relationships between variables, and
plausible values of the probabilities?
Inference: If not all variables are observable, how to compute the conditional distribution of latent variables given evidence? Computing p(H|A) would require summing over all 26 configurations of the
unobserved variables
),,,,,,,( 87654321 XXXXXXXXP
Recap of Basic Prob. Concepts
© Eric Xing @ CMU, 2006-2016
A
C
F
G H
ED
BA
C
F
G H
ED
BA
C
F
G H
ED
BA
C
F
G H
ED
B
2
![Page 3: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/3.jpg)
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
What is a Graphical Model?--- Multivariate Distribution in High-D Space
A possible world for cellular signal transduction:
© Eric Xing @ CMU, 2006-2016 3
![Page 4: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/4.jpg)
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor B
Membrane
Cytosol
X1 X2
X3 X4 X5
X6
X7 X8
GM: Structure Simplifies Representation
Dependencies among variables
© Eric Xing @ CMU, 2006-2016 4
![Page 5: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/5.jpg)
If Xi's are conditionally independent (as described by a PGM), the joint can be factored to a product of simpler terms, e.g.,
Why we may favor a PGM? Incorporation of domain knowledge and causal (logical) structures
P(X1, X2, X3, X4, X5, X6, X7, X8)
= P(X1) P(X2) P(X3| X1) P(X4| X2) P(X5| X2)P(X6| X3, X4) P(X7| X6) P(X8| X5, X6)
Probabilistic Graphical Models
© Eric Xing @ CMU, 2006-2016
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
X1 X2
X3 X4 X5
X6
X7 X8
1+1+2+2+2+4+2+4=18, a 16-fold reduction from 28 in representation cost !
Stay tune for what are these independencies!
5
![Page 6: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/6.jpg)
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
GM: Data Integration
© Eric Xing @ CMU, 2006-2016 6
![Page 7: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/7.jpg)
More Data Integration Text + Image + Network Holistic Social Media
Genome + Proteome + Transcritome + Phenome + … PanOmic Biology
© Eric Xing @ CMU, 2006-2016 7
![Page 8: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/8.jpg)
If Xi's are conditionally independent (as described by a PGM), the joint can be factored to a product of simpler terms, e.g.,
Why we may favor a PGM? Incorporation of domain knowledge and causal (logical) structures
Modular combination of heterogeneous parts – data fusion
Probabilistic Graphical Models
© Eric Xing @ CMU, 2006-2016
2+2+4+4+4+8+4+8=36, an 8-fold reduction from 28 in representation cost !
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BXX11 XX22
XX33 XX44 XX55
XX66
XX77 XX88
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BXX11 XX22
XX33 XX44 XX55
XX66
XX77 XX88
XX11 XX22
XX33 XX44 XX55
XX66
XX77 XX88
P(X1, X2, X3, X4, X5, X6, X7, X8)
= P(X2) P(X4| X2) P(X5| X2) P(X1) P(X3| X1) P(X6| X3, X4) P(X7| X6) P(X8| X5, X6)
8
![Page 9: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/9.jpg)
Hhhphdp
hphdpdhp)()|(
)()|()|(
Posteriorprobability
Likelihood Priorprobability
Sum over space of hypotheses
Rational Statistical Inference
This allows us to capture uncertainty about the model in a principled way
But how can we specify and represent a complicated model? Typically the number of genes need to be modeled are in the order of thousands!
© Eric Xing @ CMU, 2006-2016
h
d
The Bayes Theorem:
9
![Page 10: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/10.jpg)
GM: MLE and Bayesian Learning Probabilistic statements of is conditioned on the values of the
observed variables Aobs and prior p( |)
© Eric Xing @ CMU, 2006-2016
(A,B,C,D,E,…)=(T,F,F,T,F,…)A= (A,B,C,D,E,…)=(T,F,T,T,F,…)
……..(A,B,C,D,E,…)=(F,T,T,T,F,…)
A
C
F
G H
ED
BA
C
F
G H
ED
B A
C
F
G H
ED
BA
C
F
G H
ED
BA
C
F
G H
ED
B
0.9 0.1
c
dc
0.2 0.8
0.01 0.99
0.9 0.1
dcdd
c
DC P(F | C,D)0.9 0.1
c
dc
0.2 0.8
0.01 0.99
0.9 0.1
dcdd
c
DC P(F | C,D)
p()
);()|();|( ΘΘΘ ppp AA
posterior likelihood prior
ΘΘΘΘ dpBayes ),|( A
10
![Page 11: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/11.jpg)
If Xi's are conditionally independent (as described by a PGM), the joint can be factored to a product of simpler terms, e.g.,
Why we may favor a PGM? Incorporation of domain knowledge and causal (logical) structures
Modular combination of heterogeneous parts – data fusion
Bayesian Philosophy Knowledge meets data
Probabilistic Graphical Models
© Eric Xing @ CMU, 2006-2016
2+2+4+4+4+8+4+8=36, an 8-fold reduction from 28 in representation cost !
P(X1, X2, X3, X4, X5, X6, X7, X8)
= P(X1) P(X2) P(X3| X1) P(X4| X2) P(X5| X2)P(X6| X3, X4) P(X7| X6) P(X8| X5, X6)
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
X1 X2
X3 X4 X5
X6
X7 X8
11
![Page 12: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/12.jpg)
So What Is a PGM After All?
© Eric Xing @ CMU, 2006-2016
In a nutshell:
PGM = Multivariate Statistics + Structure
12
GM = Multivariate Obj. Func. + Structure
![Page 13: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/13.jpg)
So What Is a PGM After All? The informal blurb:
It is a smart way to write/specify/compose/design exponentially-large probability distributions without paying an exponential cost, and at the same time endow the distributions with structured semantics
A more formal description: It refers to a family of distributions on a set of random variables that are
compatible with all the probabilistic independence propositions encoded by a graph that connects these variables
© Eric Xing @ CMU, 2006-2016
A
C
F
G H
ED
BA
C
F
G H
ED
B A
C
F
G H
ED
BA
C
F
G H
ED
BA
C
F
G H
ED
B
)( 87654321 ,X,X,X,X,X,X,XX P),()(),(
)|()|()|()()()( :
65867436
25242132181
XXXPXXPXXXPXXPXXPXXXPXPXPXP
13
![Page 14: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/14.jpg)
Directed edges give causality relationships (Bayesian Network or Directed Graphical Model):
Undirected edges simply give correlations between variables (Markov Random Field or Undirected Graphical model):
Two types of GMs
© Eric Xing @ CMU, 2006-2016
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
X1 X2
X3 X4 X5
X6
X7 X8
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
X1 X2
X3 X4 X5
X6
X7 X8
P(X1, X2, X3, X4, X5, X6, X7, X8)
= P(X1) P(X2) P(X3| X1) P(X4| X2) P(X5| X2)P(X6| X3, X4) P(X7| X6) P(X8| X5, X6)
P(X1, X2, X3, X4, X5, X6, X7, X8)
= 1/Z exp{E(X1)+E(X2)+E(X3, X1)+E(X4, X2)+E(X5, X2)+ E(X6, X3, X4)+E(X7, X6)+E(X8, X5, X6)}
14
![Page 15: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/15.jpg)
Towards structural specification of probability distribution
Separation properties in the graph imply independence properties about the associated variables
For the graph to be useful, any conditional independence properties we can derive from the graph should hold for the probability distribution that the graph represents
The Equivalence TheoremFor a graph G,Let D1 denote the family of all distributions that satisfy I(G),Let D2 denote the family of all distributions that factor according to G,Then D1≡D2.
© Eric Xing @ CMU, 2006-2016 15
![Page 16: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/16.jpg)
Structure: DAG
• Meaning: a node is conditionally independentof every other node in the network outside its Markov blanket
• Local conditional distributions (CPD) and the DAGcompletely determine the joint dist.
• Give causality relationships, and facilitate a generativeprocess
X
Y1 Y2
Descendent
Ancestor
Parent
Children's co-parentChildren's co-parent
Child
Bayesian Networks
© Eric Xing @ CMU, 2006-2016 16
![Page 17: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/17.jpg)
Structure: undirected graph
• Meaning: a node is conditionally independent of every other node in the network given its Directed neighbors
• Local contingency functions (potentials) and the cliques in the graph completely determine the joint dist.
• Give correlations between variables, but no explicit way to generate samples
X
Y1 Y2
Markov Random Fields
© Eric Xing @ CMU, 2006-2016 17
![Page 18: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/18.jpg)
Density estimation
Regression
Classification
Parametric and nonparametric methods
Linear, conditional mixture, nonparametric
Generative and discriminative approach
Q
X
Q
X
X Y
m,s
X X
GMs are your old friends
© Eric Xing @ CMU, 2006-2016
Clustering 18
![Page 19: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/19.jpg)
(Picture by Zoubin Ghahramani and Sam Roweis)
© Eric Xing @ CMU, 2006-2016
An (incomplete)
genealogy of graphical
models
19
![Page 20: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/20.jpg)
Fancier GMs: machine translation
© Eric Xing @ CMU, 2006-2016
SMT
The HM-BiTAM model (B. Zhao and E.P Xing, ACL 2006)
20
![Page 21: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/21.jpg)
Fancier GMs: solid state physics
© Eric Xing @ CMU, 2006-2016
Ising/Potts model
21
![Page 22: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/22.jpg)
A Generative Scheme for model design
© Eric Xing @ CMU, 2006-2016 22
![Page 23: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/23.jpg)
Why graphical models
A language for communication A language for computation A language for development
Origins: Wright 1920’s Independently developed by Spiegelhalter and Lauritzen in statistics and Pearl in
computer science in the late 1980’s
© Eric Xing @ CMU, 2006-2016 23
![Page 24: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/24.jpg)
Probability theory provides the glue whereby the parts are combined, ensuring that the system as a whole is consistent, and providing ways to interface models to data.
The graph theoretic side of graphical models provides both an intuitively appealing interface by which humans can model highly-interacting sets of variables as well as a data structure that lends itself naturally to the design of efficient general-purpose algorithms.
Many of the classical multivariate probabilistic systems studied in fields such as statistics, systems engineering, information theory, pattern recognition and statistical mechanics are special cases of the general graphical model formalism
The graphical model framework provides a way to view all of these systems as instances of a common underlying formalism.
--- M. Jordan
Why graphical models
© Eric Xing @ CMU, 2006-2016 24
![Page 25: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/25.jpg)
Bayesian Network: Factorization Theorem
Theorem: Given a DAG, The most general form of the probability distribution that is consistent with the (probabilistic independence properties encoded in the) graph factors according to “node given its parents”:
where is the set of parents of xi. d is the number of nodes (variables) in the graph.
P(X1, X2, X3, X4, X5, X6, X7, X8)
= P(X1) P(X2) P(X3| X1) P(X4| X2) P(X5| X2)P(X6| X3, X4) P(X7| X6) P(X8| X5, X6)
i
i iXPP )|()( XX
iX
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
X1 X2
X3 X4 X5
X6
X7 X8
25© Eric Xing @ CMU, 2006-2016
![Page 26: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/26.jpg)
Example: a pedigree of people
A0
A1
Ag
B0
B1
Bg
M0
M1
F0
F1
Fg
C0
C1
Cg
Sg
Genetic Pedigree
26© Eric Xing @ CMU, 2006-2016
![Page 27: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/27.jpg)
Specification of a BN There are two components to any GM:
the qualitative specification the quantitative specification
A
C
F
G H
ED
BA
C
F
G H
ED
BA
C
F
G H
ED
B
0.9 0.1
c
dc
0.2 0.8
0.01 0.99
0.9 0.1
dcdd
c
DC P(F | C,D)0.9 0.1
c
dc
0.2 0.8
0.01 0.99
0.9 0.1
dcdd
c
DC P(F | C,D)
27© Eric Xing @ CMU, 2006-2016
![Page 28: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/28.jpg)
Qualitative Specification Where does the qualitative specification come from?
Prior knowledge of causal relationships Prior knowledge of modular relationships Assessment from experts Learning from data We simply link a certain architecture (e.g. a layered graph) …
28© Eric Xing @ CMU, 2006-2016
![Page 29: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/29.jpg)
Bayesian Network: Factorization Theorem
Theorem: Given a DAG, The most general form of the probability distribution that is consistent with the (probabilistic independence properties encoded in the) graph factors according to “node given its parents”:
where is the set of parents of xi. d is the number of nodes (variables) in the graph.
P(X1, X2, X3, X4, X5, X6, X7, X8)
= P(X1) P(X2) P(X3| X1) P(X4| X2) P(X5| X2)P(X6| X3, X4) P(X7| X6) P(X8| X5, X6)
i
i iXPP )|()( XX
iX
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
X1 X2
X3 X4 X5
X6
X7 X8
29© Eric Xing @ CMU, 2006-2016
![Page 30: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/30.jpg)
A CB
A
C
B
A
B
C
Local Structures & Independencies Common parent
Fixing B decouples A and C"given the level of gene B, the levels of A and C are independent"
Cascade Knowing B decouples A and C
"given the level of gene B, the level gene A provides no extra prediction value for the level of gene C"
V-structure Knowing C couples A and B
because A can "explain away" B w.r.t. C"If A correlates to C, then chance for B to also correlate to B will decrease"
The language is compact, the concepts are rich!
30© Eric Xing @ CMU, 2006-2016
![Page 31: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/31.jpg)
A simple justification
A
B
C
31© Eric Xing @ CMU, 2006-2016
![Page 32: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/32.jpg)
Graph separation criterion D-separation criterion for Bayesian networks (D for Directed
edges):
Definition: variables x and y are D-separated (conditionally independent) given z if they are separated in the moralized ancestral graph
Example:
32© Eric Xing @ CMU, 2006-2016
![Page 33: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/33.jpg)
Structure: DAG
• Meaning: a node is conditionally independentof every other node in the network outside its Markov blanket
• Local conditional distributions (CPD) and the DAG completely determine the joint dist.
• Give causalityrelationships, and facilitate a generative process
X
Y1 Y2
Descendent
Ancestor
Parent
Children's co-parentChildren's co-parent
Child
Local Markov properties of DAGs
33© Eric Xing @ CMU, 2006-2016
![Page 34: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/34.jpg)
Global Markov properties of DAGs X is d-separated (directed-separated) from Z given Y if we can't
send a ball from any node in X to any node in Z using the "Bayes-ball" algorithm illustrated bellow (and plus some boundary conditions):
• Defn: I(G)all independence properties that correspond to d-separation:
• D-separation is sound and complete
);(dsep:)(I YZXYZXG G
34© Eric Xing @ CMU, 2006-2016
![Page 35: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/35.jpg)
Example: Complete the I(G) of this
graph:
x1
x2
x4
x3
Essentially: A BN is a database of Pr. Independence statements among variables.
35© Eric Xing @ CMU, 2006-2016
![Page 36: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/36.jpg)
Towards quantitative specification of probability distribution
Separation properties in the graph imply independence properties about the associated variables
For the graph to be useful, any conditional independence properties we can derive from the graph should hold for the probability distribution that the graph represents
The Equivalence TheoremFor a graph G,Let D1 denote the family of all distributions that satisfy I(G),Let D2 denote the family of all distributions that factor according to G,Then D1≡D2.
36© Eric Xing @ CMU, 2006-2016
![Page 37: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/37.jpg)
a0 0.75a1 0.25
b0 0.33b1 0.67
a0b0 a0b1 a1b0 a1b1
c0 0.45 1 0.9 0.7c1 0.55 0 0.1 0.3
A B
C
P(a,b,c.d) = P(a)P(b)P(c|a,b)P(d|c)
Dc0 c1
d0 0.3 0.5d1 07 0.5
Conditional probability tables (CPTs)
37© Eric Xing @ CMU, 2006-2016
![Page 38: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/38.jpg)
A B
C
P(a,b,c.d) = P(a)P(b)P(c|a,b)P(d|c)
D
A~N(μa, Σa) B~N(μb, Σb)
C~N(A+B, Σc)
D~N(μa+C, Σa)D
C
P(D| C)
Conditional probability density func. (CPDs)
38© Eric Xing @ CMU, 2006-2016
![Page 39: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/39.jpg)
Conditional Independencies
X1
Y
Features
Label
X2 Xn-1 Xn
What is this model
1. When Y is observed?2. When Y is unobserved?
39© Eric Xing @ CMU, 2006-2016
![Page 40: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/40.jpg)
Conditionally Independent Observations
Data = {y1,…yn}
Model parameters
X1 X2 Xn-1 Xn
40© Eric Xing @ CMU, 2006-2016
![Page 41: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/41.jpg)
“Plate” Notation
Xi
i=1:n
Data = {x1,…xn}
Model parameters
Plate = rectangle in graphical model
variables within a plate are replicatedin a conditionally independent manner
41© Eric Xing @ CMU, 2006-2016
![Page 42: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/42.jpg)
Example: Gaussian Model
xi
i=1:n
Generative model:
p(x1,…xn | , ) = P p(xi | , )= p(data | parameters)= p(D | )
where = {, }
Likelihood = p(data | parameters) = p( D | ) = L ()
Likelihood tells us how likely the observed data are conditioned on a particular setting of the parameters Often easier to work with log L ()
42© Eric Xing @ CMU, 2006-2016
![Page 43: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/43.jpg)
Bayesian models
xi
i=1:n
43© Eric Xing @ CMU, 2006-2016
![Page 44: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/44.jpg)
Summary Represent dependency structure with a directed acyclic graph
Node <-> random variable Edges encode dependencies
Absence of edge -> conditional independence Plate representation A GM is a database of prob. Independence statement on variables
The factorization theorem of the joint probability Local specification globally consistent distribution Local representation for exponentially complex state-space It is a smart way to write/specify/compose/design exponentially-large
probability distributions without paying an exponential cost, and at the same time endow the distributions with structured semantics
Support efficient inference and learning44© Eric Xing @ CMU, 2006-2016
![Page 45: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/45.jpg)
Inference and Learning We now have compact representations of probability
distributions: BN
A BN M describes a unique probability distribution P
Typical tasks:
Task 1: How do we answer queries about P?
We use inference as a name for the process of computing answers to such queries
Task 2: How do we estimate a plausible model M from data D?
i. We use learning as a name for the process of obtaining point estimate of M.
ii. But for Bayesian, they seek p(M |D), which is actually an inference problem.
iii. When not all variables are observable, even computing point estimate of M need to do inference to impute the missing data.
45© Eric Xing @ CMU, 2006-2016
![Page 46: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/46.jpg)
∑ ∑ )(),,()(1
1x x
kk
,,x,xPPP vx
vHv xXXxH
Inferential Query 1: Likelihood
Most of the queries one may ask involve evidence
Evidence xv is an assignment of values to a set Xv of nodes in the GM over varialbe set X={X1, X2, …, Xn}
Without loss of generality Xv={Xk+1, … , Xn},
Write XH=X\Xv as the set of hidden variables, XH can be or X
Simplest query: compute probability of evidence
this is often referred to as computing the likelihood of xv
46© Eric Xing @ CMU, 2006-2016
![Page 47: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/47.jpg)
Often we are interested in the conditional probability distribution of a variable given the evidence
this is the a posteriori belief in XH, given evidence xv
We usually query a subset Y of all hidden variables XH={Y,Z}and "don't care" about the remaining, Z:
the process of summing out the "don't care" variables z is called marginalization, and the resulting P(Y|xv) is called a marginal prob.
HxVHH
VH
V
VHVVH xxX
xXx
xXxXX),(
),()(
),()|(PP
PPP
z
VV xzZYxY )|,()|( PP
Inferential Query 2: Conditional Probability
47© Eric Xing @ CMU, 2006-2016
![Page 48: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/48.jpg)
Prediction: what is the probability of an outcome given the starting condition
the query node is a descendent of the evidence
Diagnosis: what is the probability of disease/fault given symptoms
the query node an ancestor of the evidence
Learning under partial observation fill in the unobserved values under an "EM" setting (more later)
The directionality of information flow between variables is not restricted by the directionality of the edges in a GM probabilistic inference can combine evidence form all parts of the network
A CB
A CB
?
?
Applications of a posteriori Belief
48© Eric Xing @ CMU, 2006-2016
![Page 49: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/49.jpg)
In this query we want to find the most probable joint assignment (MPA) for some variables of interest
Such reasoning is usually performed under some given evidence xv, and ignoring (the values of) other variables Z:
this is the maximum a posteriori configuration of Y.
z
VyVyV xzZYxYxY )|,(maxarg)|(maxarg|* PP
Inferential Query 3: Most Probable Assignment
49© Eric Xing @ CMU, 2006-2016
![Page 50: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/50.jpg)
Thm:Computing P(XH=xH| xv) in an arbitrary GM is NP-hard
Hardness does not mean we cannot solve inference
It implies that we cannot find a general procedure that works efficiently for arbitrary GMs
For particular families of GMs, we can have provably efficient procedures
Complexity of Inference
50© Eric Xing @ CMU, 2006-2016
![Page 51: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/51.jpg)
Approaches to inference Exact inference algorithms
The elimination algorithm Belief propagation The junction tree algorithms (but will not cover in detail here)
Approximate inference techniques
Variational algorithms Stochastic simulation / sampling methods Markov chain Monte Carlo methods
51© Eric Xing @ CMU, 2006-2016
![Page 52: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/52.jpg)
A food web:
Query: P(h)
By chain decomposition, we get
Marginalization and Elimination
52© Eric Xing @ CMU, 2006-2016
g f e d c b a
hgfedcbaPhP ),,,,,,,()(
B A
DC
E F
G H
a naïve summation needs to enumerate over an exponential number of terms
What is the probability that hawks are leaving given that the grass condition is poor?
),|()|()|(),|()|()|()()( fehPegPafPdcePadPbcPbPaPg f e d c b a
![Page 53: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/53.jpg)
Query: P(A |h) Need to eliminate: B,C,D,E,F,G,H
Initial factors:
Choose an elimination order: H,G,F,E,D,C,B
Step 1: Conditioning (fix the evidence node (i.e., h) on its observed value (i.e., )):
This step is isomorphic to a marginalization step:
B A
DC
E F
G H
),|()|()|(),|()|()|()()( fehPegPafPdcePadPbcPbPaP
),|~(),( fehhpfemh h~
h
h hhfehpfem )~(),|(),(
B A
DC
E F
G
Variable Elimination
53© Eric Xing @ CMU, 2006-2016
![Page 54: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/54.jpg)
Query: P(B |h) Need to eliminate: B,C,D,E,F,G
Initial factors:
Step 2: Eliminate G compute
B A
DC
E F
G H),()|()|(),|()|()|()()(
),|()|()|(),|()|()|()()(femegPafPdcePadPbcPbPaP
fehPegPafPdcePadPbcPbPaP
h
1)|()( g
g egpemB A
DC
E F),()|(),|()|()|()()(
),()()|(),|()|()|()()(
femafPdcePadPbcPbPaP
fememafPdcePadPbcPbPaP
h
hg
Example: Variable Elimination
54© Eric Xing @ CMU, 2006-2016
![Page 55: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/55.jpg)
Query: P(B |h) Need to eliminate: B,C,D,E,F
Initial factors:
Step 3: Eliminate F compute
B A
DC
E F
G H
Example: Variable Elimination
),()|(),|()|()|()()(),()|()|(),|()|()|()()(
),|()|()|(),|()|()|()()(
femafPdcePadPbcPbPaPfemegPafPdcePadPbcPbPaP
fehPegPafPdcePadPbcPbPaP
h
h
f
hf femafpaem ),()|(),(
),(),|()|()|()()( eamdcePadPbcPbPaP f
B A
DC
E
55© Eric Xing @ CMU, 2006-2016
![Page 56: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/56.jpg)
B A
DC
E
Query: P(B |h) Need to eliminate: B,C,D,E
Initial factors:
Step 4: Eliminate E compute
B A
DC
E F
G H
Example: Variable Elimination
),(),|()|()|()()(),()|(),|()|()|()()(
),()|()|(),|()|()|()()(),|()|()|(),|()|()|()()(
eamdcePadPbcPbPaPfemafPdcePadPbcPbPaP
femegPafPdcePadPbcPbPaPfehPegPafPdcePadPbcPbPaP
f
h
h
e
fe eamdcepdcam ),(),|(),,(
),,()|()|()()( dcamadPbcPbPaP e
B A
DC
56© Eric Xing @ CMU, 2006-2016
![Page 57: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/57.jpg)
Query: P(B |h) Need to eliminate: B,C,D
Initial factors:
Step 5: Eliminate D compute
B A
DC
E F
G H
Example: Variable Elimination
),,()|()|()()(
),(),|()|()|()()(),()|(),|()|()|()()(
),()|()|(),|()|()|()()(),|()|()|(),|()|()|()()(
dcamadPbcPbPaP
eamdcePadPbcPbPaPfemafPdcePadPbcPbPaP
femegPafPdcePadPbcPbPaPfehPegPafPdcePadPbcPbPaP
e
f
h
h
d
ed dcamadpcam ),,()|(),(
),()|()()( camdcPbPaP d
B A
C
57© Eric Xing @ CMU, 2006-2016
![Page 58: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/58.jpg)
Query: P(B |h) Need to eliminate: B,C
Initial factors:
Step 6: Eliminate C compute
B A
DC
E F
G H
Example: Variable Elimination
),()|()()( camdcPbPaP d
c
dc cambcpbam ),()|(),(
),()|()()(),,()|()|()()(
),(),|()|()|()()(),()|(),|()|()|()()(
),()|()|(),|()|()|()()(),|()|()|(),|()|()|()()(
camdcPbPaPdcamadPdcPbPaP
eamdcePadPdcPbPaPfemafPdcePadPdcPbPaP
femegPafPdcePadPdcPbPaPfehPegPafPdcePadPdcPbPaP
d
e
f
h
h
B A
58© Eric Xing @ CMU, 2006-2016
![Page 59: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/59.jpg)
Query: P(B |h) Need to eliminate: B
Initial factors:
Step 7: Eliminate B compute
B A
DC
E F
G H
Example: Variable Elimination
),()()(),()|()()(
),,()|()|()()(
),(),|()|()|()()(),()|(),|()|()|()()(
),()|()|(),|()|()|()()(),|()|()|(),|()|()|()()(
bambPaPcamdcPbPaP
dcamadPdcPbPaP
eamdcePadPdcPbPaPfemafPdcePadPdcPbPaP
femegPafPdcePadPdcPbPaPfehPegPafPdcePadPdcPbPaP
c
d
e
f
h
h
b
cb bambpam ),()()(
)()( amaP b
A
59© Eric Xing @ CMU, 2006-2016
![Page 60: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/60.jpg)
Query: P(B |h) Need to eliminate: B
Initial factors:
Step 8: Wrap-up
B A
DC
E F
G H
Example: Variable Elimination
)()(),()()(
),()|()()(),,()|()|()()(
),(),|()|()|()()(),()|(),|()|()|()()(
),()|()|(),|()|()|()()(),|()|()|(),|()|()|()()(
amaPbambPaP
camdcPbPaPdcamadPdcPbPaP
eamdcePadPdcPbPaPfemafPdcePadPdcPbPaP
femegPafPdcePadPdcPbPaPfehPegPafPdcePadPdcPbPaP
b
c
d
e
f
h
h
, )()()~,( amaphap b
ab
b
amapamaphaP
)()()()()~|(
a
b amaphp )()()~(
60© Eric Xing @ CMU, 2006-2016
![Page 61: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/61.jpg)
Suppose in one elimination step we compute
This requires multiplications
─ For each value of x, y1, …, yk, we do k multiplications
additions
─ For each value of y1, …, yk , we do |Val(X)| additions
Complexity is exponential in number of variables in the intermediate factor
x
kxkx yyxmyym ),,,('),,( 11
k
icikx i
xmyyxm1
1 ),(),,,(' y
i
CiXk )Val()Val( Y
i
CiX )Val()Val( Y
Complexity of variable elimination
61© Eric Xing @ CMU, 2006-2016
![Page 62: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/62.jpg)
Induced dependency during marginalization is captured in elimination cliques Summation <-> elimination Intermediate term <-> elimination clique
Can this lead to an generic inference algorithm?
Elimination Clique
E F
H
A
E F
B A
C
E
G
A
DC
E
A
DC
B A A
62© Eric Xing @ CMU, 2006-2016
![Page 63: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/63.jpg)
Elimination message passing on a clique tree
Messages can be reused
E F
H
A
E F
B A
C
E
G
A
DC
E
A
DC
B A A
hmgm
emfm
bmcm
dm
From Elimination to Message Passing
e
fg
e
eamemdcepdcam
),()(),|(),,(
63© Eric Xing @ CMU, 2006-2016
![Page 64: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/64.jpg)
E F
H
A
E F
B A
C
E
G
A
DC
E
A
DC
B A A
cm bm
gm
em
dmfm
hm
From Elimination to Message Passing
Elimination message passing on a clique tree Another query ...
Messages mf and mh are reused, others need to be recomputed
64© Eric Xing @ CMU, 2006-2016
![Page 65: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/65.jpg)
From elimination to message passing Recall ELIMINATION algorithm:
Choose an ordering Z in which query node f is the final node Place all potentials on an active list Eliminate node i by removing all potentials containing i, take sum/product over xi. Place the resultant factor back on the list
For a TREE graph: Choose query node f as the root of the tree View tree as a directed tree with edges pointing towards from f Elimination ordering based on depth-first traversal Elimination of each node can be considered as
message-passing (or Belief Propagation) directly along tree branches, rather than on some transformed graphs
thus, we can use the tree itself as a data-structure to do general inference!!
65© Eric Xing @ CMU, 2006-2016
![Page 66: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/66.jpg)
f
i
j
k l
Message passing for trees
Let mij(xi) denote the factor resulting from eliminating variables from bellow up to i, which is a function of xi:
This is reminiscent of a message sent from j to i.
mij(xi) represents a "belief" of xi from xj!
66© Eric Xing @ CMU, 2006-2016
![Page 67: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/67.jpg)
Elimination on trees is equivalent to message passing along tree branches!
f
i
j
k l67© Eric Xing @ CMU, 2006-2016
![Page 68: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/68.jpg)
m24(X 4)
X1
X2
X3X4
The message passing protocol: A two-pass algorithm:
m21(X 1)
m32(X 2) m42(X 2)
m12(X 2)
m23(X 3)68© Eric Xing @ CMU, 2006-2016
![Page 69: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/69.jpg)
Belief Propagation (SP-algorithm): Sequential implementation
69© Eric Xing @ CMU, 2006-2016
![Page 70: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/70.jpg)
Inference on general GM Now, what if the GM is not a tree-like graph?
Can we still directly run message message-passing protocol along its edges?
For non-trees, we do not have the guarantee that message-passing will be consistent!
Then what? Construct a graph data-structure from P that has a tree structure, and run message-passing
on it!
Junction tree algorithm
70© Eric Xing @ CMU, 2006-2016
![Page 71: Machine Learningmgormley/courses/10701-f16/slides/lecture17-GM.pdf · Graphical Models and Exact Inference Eric Xing Lecture 17, November 7, 2016 Machine Learning 10-701, Fall 2016](https://reader035.fdocuments.in/reader035/viewer/2022081407/604b7bb0953fca15082c2ad9/html5/thumbnails/71.jpg)
Summary The simple Eliminate algorithm captures the key algorithmic
Operation underlying probabilistic inference:--- That of taking a sum over product of potential functions
The computational complexity of the Eliminate algorithm can be reduced to purely graph-theoretic considerations.
This graph interpretation will also provide hints about how to design improved inference algorithms
What can we say about the overall computational complexity of the algorithm? In particular, how can we control the "size" of the summands that appear in the sequence of summation operation.
71© Eric Xing @ CMU, 2006-2016