Probabilistic Graphical Modelssrihari/papers/PGM-ESNAM.pdfProbabilistic graphical models (PGMs),...

P

Probabilistic Graphical Models

Sargur N. SrihariDepartment of Computer Science andEngineering, University at Buffalo, The StateUniversity of New York, Buffalo, NY, USA

Synonyms

Bayesian networks; Markov networks; Markovrandom fields

Glossary

Bayesiannetwork(BN)

A directed graph whose nodesrepresent variables, and edgesrepresent influences. Togetherwith conditional probabilitydistributions, a Bayesian networkrepresents the joint probabilitydistribution of its variables.

Conditionalprobabilitydistribution

Assignment of probabilities to allinstances of a set of variableswhen the value of one or morevariables is known.

Conditionalrandom field(CRF)

A partially directed graph thatrepresents a conditionaldistribution.

Factor graph A type of parameterization ofPGMs in the form of bipartitegraphs of factor nodes andvariable nodes, where a factornode indicates that the variablenodes is connected to form aclique in a PGM.

Graph A set of nodes and edges, whereedges connect pairs of nodes.

Inference Process of answering queriesusing the distribution as themodel of the world.

Jointprobabilitydistribution

Assignment of probabilities toeach instance of a set of randomvariables.

Log-linearmodel

A Markov network representedusing features and energyfunctions.

Markovnetwork(MN)

An undirected graph whosenodes represent variables, andedges represent influences.Together with factors definedover subsets of variables, aMarkov network represents thejoint probability distribution ofits variables.

Markovrandom field(MRF)

Synonymous with Markovnetwork. Term more commonlyused in computer vision.

# Springer Science+Business Media LLC 2017R. Alhajj, J. Rokne (eds.), Encyclopedia of Social Network Analysis and Mining,DOI 10.1007/978-1-4614-7163-9_156-1

http://link.springer.com/Bayesian networks

http://link.springer.com/Markov networks

http://link.springer.com/Markov random fields

http://link.springer.com/Markov random fields

Partiallydirectedgraph

A PGM with both directed andundirected edges.

Probabilisticgraphicalmodel(PGM)

A graphical representation ofjoint probability distributionswhere nodes represent variables,and edges represent influences.

Definition

Probabilistic graphical models (PGMs), alsoknown as graphical models, are representationsof probability distributions over several variables.They use a graph-theoretic representation wherenodes correspond to random variables and edgescorrespond to interactions between them. Whenthe edges are directed, they are known as Bayes-ian networks (BNs). Since the edges of a BNtypically represent causality between variables,they are also referred to as causal BNs. PGMswith undirected edges are known as Markov net-works (MNs) or Markov random fields (MRFs).

Introduction

Many automation tasks require reasoning to reachconclusions and perform actions. Examples are(i) a medical artificial intelligence (AI) programuses patient symptoms and test results to deter-mine disease and propose a treatment, (ii) anautonomous vehicle obtains its location informa-tion from cameras and sonar and determines aroute towards its destination, and (iii) an interac-tive on-line assistant responds to a spoken requestand retrieves relevant data.

PGMs are declarative representations wherethe knowledge representation and reasoningaspects are kept separate. They provide a powerfulframework for modeling the joint distribution of alarge-number n of random variables w = {X1, X2,. . . Xn}. PGMs use graphical representations thatconsists of nodes (also called vertices) and edges(also called links), where each node represents arandom variable (or a group of random variables)and links express influences between variables.

They allow distributions to be written tractablyeven when the explicit representation of the jointdistribution is astronomically large: when the setof possible values of w, V al(w), are very large(exponential in n), PGMs exploit independenciesbetween variables, thereby resulting in great sav-ings in the number of parameters needed forrepresenting full joint distributions.

PGMs are used to answer queries of interest,such as the probability of a particular assignmentof the values of all the variables, i.e., x � Val(w).Other queries of interest are conditional probabil-ity of latent variables given values of observablevariables, maximum a posteriori probability ofvariables of interest, the probability of a particularoutcome when a causal variable is set to a partic-ular value, etc. Answers are produced using aninference procedure.

Key Points

PGMs provide (i) a simple way to visualize struc-ture of probabilistic model, (ii) insights into prop-erties of model, e.g., conditional independenceproperties by inspecting graph, and (iii) complexcomputations required to perform inference andlearning are expressed as graphical manipulations.

A powerful aspect of graphical models is that itis not necessary to state whether the distributionsthey represent are discrete or continuous: a spe-cific graph can make probabilistic statementsabout a broad class of distributions. The theoryof PGM representation and analysis is a marriagebetween graph theory and probability theory. Thegraph-theoretic representation augments analysisinstead of using pure algebra.

Historical Background

Probability theory was developed to representuncertainty. Gerolamo Cardano (1501–1576) waspossibly the earliest to formulate a theory ofchance. The French mathematicians Blaise Pascal(1623–1662) and Pierre-Simon Laplace(1749–1827) laid the foundations, with Laplace’smajor contribution to probability theory appearing

2 Probabilistic Graphical Models

in 1812. The English clergyman Thomas Bayes(1701–1761) stated the theorem named after himwhich relates conditional and marginal probabili-ties of variables by a simple application of the sumand product rules of probability.

The use of PGMs allows the application ofprinciples of probability theory to large sets ofvariables which would otherwise be computation-ally infeasible. An early use of BNs, before thegeneral framework was defined, was in geneticmodeling of transmission of certain propertiessuch as blood type from parent to child. BNs, asdiagrammatic representations of causal probabil-ity distributions, were first defined by the com-puter scientist Judea Pearl (1988). A BN is notnecessarily based on the fully Bayesian approachof converting prior distributions of parameters toposterior distribution, although it becomes usefulwhen data sets are limited.

A Markov process, named after the Russianmathematician Andrey Markov (1856–1952),describes the linear dependency of a variable onits previous states in a chain. Markov RandomFields were a generalization to model the two-dimensional dependency of a pixel on other pixels.MNs with log-linear representations have beenaround for a long time, with their origins in statis-tical physics. In the Ising model, which is due to thephysicist Ernt Ising (1900–1998), the energy of aphysical system of interacting atoms is determinedfrom their spin, where each atom’s spin is the sumof its electron spins. Each atom is characterized bya binary random variable Xi � {+1, �1} whosevalue xi is the direction of its spin. Its energyfunction has the parametric form � ij(xi, xj)=�wij

xixj which is symmetric in Xi, Xj. They are used toanswer questions concerning an infinite number ofatoms, e.g., determine the probability of a config-uration where majority of spins are +1 (or �1)versus more mixed ones. The answer depends onthe strength of interactions, e.g., by multiplying allweights by a temperature parameter.

BNs are popular in AI and statistics. MNs,which are better suited to express soft constraintsbetween variables, are popular in computer visionand text analytics.

Probabilistic Graphical Models

This discussion is divided into three parts: repre-sentation of PGMs, inference using PGMs, andlearning of PGMs.

RepresentationPGMs are declarative representations whereknowledge representation is kept separated fromreasoning. This has the advantage that reasoningalgorithms can be developed independently of thedomain and domain knowledge can be improvedwithout needing to modify reasoning algorithms.PGMs where graphs are directed acyclic graphsand directionality is associated with edges, thattypically express causal relationships, are knownas BNs. PGMs where links are undirected, i.e., donot have directionality, correspond to MNs, orMarkov random fields (MRFs).

Bayesian NetworksA BN represents a joint probability distributionP over multiple variables w by means of a directedgraph G. Edges in the graph represent influencesbetween variables represented as nodes. While theinfluences are often causal – as determined bydomain experts – they need not be so. Conditionalprobability distributions (CPDs) represent the localconditional distributions P (Xi|pa(Xi)), where paare parent nodes. The joint distribution has thefactorization: P wð Þ ¼ ∏n

i¼1P Xij pa Xið Þð Þ , whichis the chain rule of BNs.

A BN G implicitly encodes a set of conditionalindependence assumptions I(G). Each indepen-dence is of the form (X ⊥ Y|Z), which can beread as: X is independent of Y given Z. If P is aprobability distribution with independencies I(P),then G is an I-map of P if I(G) � I(P). If P factor-izes according toG thenG is an I-map of P. This isthe key property to allowing a compact represen-tation, and crucial for understanding networkbehavior. G is a minimal I-map of P if removinga single edge renders it not an I-map.G is a perfectmap for P if I(G) = I(P). Unfortunately everydistribution does not have a perfect map. Whenmany variable independencies are present thecomplexity of the BN decreases.

Probabilistic Graphical Models 3

Local Models When the variables are discrete-valued, the CPDs, which define local distributionsof the form P (Y|X1, .., Xn), can be represented asconditional probability tables (CPTs), where eachentry is the probability of the value of Y given thevalues of its parents. While CPTs are commonlyused, they have some disadvantages, e.g., whenthe random variables have infinite domains as inthe case of continuous variables. Also, in thediscrete case, when n is large, the CPTs growexponentially. To alleviate this, CPDs can beviewed as functions that return the conditionalprobability when given the value of Y and itsparents. Furthermore such functions can berepresented in such a way as to exploit structurepresent in the distributions.

The simplest non-tabular CPD is a determinis-tic CPD where the value taken by Y is based on adeterministic function of the values of its parents{Xi}. An example is one where all the parents arebinary-valued and Y is the logical � OR of itsparents. Such a representation can be very usefulto reduce the indegree of subsequent variables andthereby reduce complexity of inference.

In a context-specific CPD several values of{Xi} define the same conditional distribution.Examples of context-specific CPDs are trees andrules. In a CPD represented as a tree, there are leafnodes and interior nodes. Each leaf node is asso-ciated with the distribution of Y while the path tothat leaf node defines the values taken by {Xi}. Ina rule-based CPD each assignment of values to{Xi} specifies the probability of a value assign-ment to Y. Tree and rule-based CPDs have severaladvantages: they are easy to understand, and theycan be automatically constructed from data.

Another type of CPD structure arises with inde-pendence of causal influence (ICI), where the com-bined influence of {Xi} is a simple combination ofthe influence of each Xi on Y in isolation. Two veryuseful models of this type are noisy-or and the classof generalized linear models. With the noisy-or,Y is binary-valued and the parents have indepen-dent parameters to activate Y. It is used widely inthe medical domain, e.g., a symptom variable suchas Fever has a very large number of parentscorresponding to diseases whose causal mecha-nisms are independent. In a generalized linear

model based on soft linear functions, Y is binary-valued and the logistic CPD is a sigmoid functionwith weights wif gni¼0, i:e:,s w0 þ

PiwiXi

� �. If Y

is multi-valued, a multinomial logistic function isdefined using the soft-max function. The numberof parameters in the CPD of an ICI model is linearin the number of parents rather than exponential.

The case of continuous variables can be han-dled well by BNs. The dependency of a continu-ous variable Y on a continuous parent X, can bemodeled as one where Y is Gaussian and theparameters depend on X, e.g., the mean of Y is alinear function of X and the variance of Y does notdepend on X. A linear Gaussian model general-izes this to several continuous parents, i.e., themean of Y is a weighted sum of the parent vari-ables. BNs based on the linear Gaussian modelprovide an alternative representation of multivar-iate Gaussian distributions – one that moredirectly reveals underlying structure.

When parents are both discrete and continuouswe have a hybrid CPD; its form depends onwhether the child is continuous or discrete. In thecase when the child Y is continuous, we can define aconditional linear Gaussian (CLG) CPD as fol-lows: if U = {U1,...,Um} are discrete parents,V = {V1,...,Vk} are continuous parents, then forevery value u � V al(U) we have a set of k + 1coefficients {au,i}, i= 0,..,k and a variances2u, suchthat P Xj u, vð Þ ¼ N au, 0 þ

Pki¼1 au, ivi;s2u

� �. A

BN is called a CLG network if every discretevariable has only discrete parents and every con-tinuous variable has a CLG CPD. A CLG modelinduces a joint distribution that is a mixture ofGaussians. Each instantiation of the discrete net-work variables contributes a componentGaussian.

A CLG model does not allow for continuousvariables to have discrete children. In a hybridmodel where the child Y is discrete and the parentis continuous, we can use a multinomial distribu-tion where for each assignment y we have a dif-ferent continuous distribution over the parent.

Independencies Independence properties areexploited to reduce computation of inference,i.e., answering queries. Separation between


nodes in a directed graph, called d-separation,allows one to determine whether an independence(X ⊥ Y|Z) holds in a distribution associated withBN structure G. BNs have two types of indepen-dencies: (i) local independencies: each node isindependent of its non-descendants given it par-ents, and (ii) global independencies induced byd-separation. These two sets of independenciesare equivalent. D-separation refers to four casesinvolving three variables X, Y, and Z as follows:indirect-causal effect (X ! Z ! Y), indirect evi-dential effect (Y ! Z ! X), common cause(X Z! Y), and common effect (X! Z Y).In the first three cases, if Z is observed then itblocks influence between X and Y. In the lastcase, known as a v-structure, an observedZ enables influence.

Reasoning in a BN strongly depends on con-nectivity. Reasoning can be top-down, calledcausal reasoning, or bottom-up, called evidentialreasoning. Another type of reasoning is inter-causal reasoning, one example of which isexplaining away where different causes of thesame effect can interact. In another type of inter-causal reasoning, parent nodes can increase theprobability of a child node.

Causality While a BN captures conditional inde-pendences in a distribution, the causal structure isnot necessarily meaningful, e.g., the directionalitycan even be antitemporal. In a good BN structure,an edge X ! Y should suggest that X causesY either directly or indirectly. While BNs withcausal structure are likely to be sparser and morenatural, the answers we obtain to probabilisticqueries are the same. While X ! Y and Y ! Xare equivalent probabilistic models they are verydifferent causal models.

Causal models are important when we need tomake interventions. Examples of causal queriesinvolving intervention are will preventingsmoking in public places likely to decrease fre-quency of lung cancer, will strengthening familyinteractions (social capital) result in increased stu-dent scores, etc. One approach to model causalrelationships is to use ideal interventions. An idealintervention, written as do(Z: = z), is one wherethe only effect is to force variable Z to take the

value z and have no other effect on other variables.The answer to an intervention query P (Y|do(z),X= x) is generally quite different from the answerto the probabilistic query P (Y|Z = z, X = x).

The identifiability of causality is complicatedby the fact that correlation between two variablesarise in multiple settings: when X causes Y, whenY causes X, or when X and Y have a commoncause. If the common cause W is observable, wecan disentangle the correlation between X andY induced by W and determine the residual corre-lation that is directly causal. However, there usu-ally are a large set of latent variables that wecannot observe. Fortunately, it is possible toanswer, at least sometimes, causal questions inmodels with latent variables using only observedcorrelations. The intervention queries that can beanswered using only conditional probabilities,which are said to be identifiable, can sometimesbe determined using query simplification rules.

Markov NetworksWhen no natural directionality exists betweenvariables, MNs offer a simpler perspective ondirected graphs. Moreover there is no guaranteeof perfect map in a BN since independencesimposed may be inappropriate for the distribution;in a perfect map the graph precisely captures theindependencies in the given distribution.

Parameterizations AMN represents a joint prob-ability distribution P over multiple variables w bymeans of an undirected graph G whose nodes cor-respond to variables and edges correspond to directprobabilistic interactions. As in BNs parameteriza-tion of aMN defines local interactions.We combinelocalmodels bymultiplying them, and convert it to alegal distribution by performing a normalization.

Affinities between variables can be capturedusing three alternative parameterizations: (i) MNas a product of potentials on cliques: good fordiscussing independence queries, (ii) factorgraph, which is a product of factors that describesa Gibbs distribution: useful for inference, and (iii)log-linear model with features, which is a productof features that describe all entries in each factor:useful for both hand-coded models and forlearning.


Gibbs Parameterization The first approach is toassociate with each set of nodes a general purposefunction called a factor, a function f from Val(D)to R where D is a subset of random variables.A factor captures compatibility between variablesin its scope and is similar to a CPD: for eachcombination there is a value. With attentionrestricted to non-negative factors: V al(A, B) to R+, the value associated with a particular assign-ment (a, b) indicates affinity between the twovalues, with a higher value indicating higher com-patibility. A Gibbs distribution generalizes theidea of a factor product.

A distribution P is a Gibbs distribution parame-terized by a set of factors F = {f1(D1), ..fk (Dk)}

if it is defined as P wð Þ ¼ 1Z~P wð Þ, where ~P wð Þ ¼ ∏

k

i¼1fi Dið Þ,Di � w is an unnormalized measure and Z

¼Pw~P wð Þ is known as the partition function.

A Gibbs distribution factorizes over a Markov net-workG if eachDi is a complete subgraph (clique) ofG. The Hammersley-Clifford theorem goes fromindependence properties of a distribution to its fac-torization, i.e., if P is a positive probability distri-bution, i.e., all probabilities are greater than zero,with independencies I(P), then it factorizesaccording to G. As with BNs G is an I-map of P.

Factors that parameterize the network arecalled clique potentials. The number of parame-ters is reduced by allowing factors only for max-imal cliques, but it obscures the structure present.Factors do not represent marginal probabilities ofthe variables within their scope. A factor is onlyone contribution to the overall joint distribution.The distribution as a whole has to take into con-sideration contributions from all factors involved.

The subclass of MNs where interactions areonly pairwise are commonly encountered, e.g.,Ising model and Boltzmann machines are popularin computer vision. Here all factors are over singlevariables f(Xj), called node potentials, or overpairs of variables f(Xj, Xk), called edge potentials.Although simple they pose a challenge for infer-ence algorithms.

Factor Graphs The graph structure of a MN doesnot reveal all structure in a Gibbs parameterization,

e.g., we cannot tell whether factors involve maxi-mal cliques or their subsets. Factor graphs areundirected graphs that make the decompositionp(w) = Пi fi(Di) explicit by using two types ofnodes: variable nodesXj denoted as ovals and factornodes fi denoted as squares (See Fig. 2c). Theycontain edges only between variable nodes andfactor nodes. They are bipartite since there are twotypes of nodes with all links go between nodes ofopposite type, and representable as two rows ofnodes: variables on top and factor nodes at bottom.Other intuitive representations are used whenderived from directed/undirected graphs. Steps inconverting a distribution expressed as undirectedgraph are as follows: create variable nodescorresponding to nodes in original, create factornodes for maximal cliques Di, and set factorsfi(Di) equal to clique potentials. Several differentfactor graphs are possible for the same distributionor graph. A directed graph can also be converted toa factor graph, where variable nodes correspond tovariable nodes in factor graph, and factor nodescorresponding to conditional distributions.

Log-Linear Models While a factor graph makesthe structure of the parameterization explicit, eachfactor is a complete table over its scope. We maywish to explicitly represent context-specific struc-ture which involve particular values of the vari-ables (as in BNs). Such patterns are more readilyseen in log-space, by taking the negative naturallogarithm of each potential.

If D is a set of random variables and f(D) isa factor (consisting of values assigned to instances ofD), E(D) = � ln f(D), thus f(D) = exp(�ϵ(D)).This has an analogy in statistical physics where theprobability f(D) of a physical state dependsinversely on its energy E(D), i.e., higher energystates have lower probability. In this representation

p wð Þ / exp �Pki¼1 ϵi Dið Þ

h i. Logarithms of cell

frequencies f(D) are referred to as log-linear instatistics and the logarithmic representationensures that the probability distribution is posi-tive. Any MN parameterized using positive fac-tors can be converted into a logarithmicrepresentation.


If D is a subset of variables, feature f (D) is afunction from D to R (a real value). A feature is afactor without a non-negativity requirement.

The following log-linear model is a represen-tation of a joint probability distribution overassignments to w:

P w : yð Þ ¼ 1

Z yð Þ expXki¼1

yif i Dið Þ !

(1)

where fi(Di) is a feature function defined overvariables Di � w, the set of all feature functions is

denoted as F ¼ f i Dið Þki¼1, kn

is the number of

features in the model, y = {yi: fi � F} is a set offeature weights, Z yð Þ ¼Px exp

Pki¼1 yif i xð Þ

� �is a partition function, and fi(x) is a shortenednotation for fi(x <Di>) with a given assignmentto the set of variables Di.

Independencies As in BNs, the graph structureof a MN encodes a set of independence assump-tions. In a MN probabilistic influence flows alongthe undirected paths in the graph and blocked ifwe condition on intervening nodes.

There are three types of independencies inMNs. Two are local independencies: (i) pairwiseindependency Ip(H) is the weakest type of inde-pendency: whenever two variables are directlyconnected, potential of being correlated not medi-ated by other variables, and (ii) Markov blanketIl(H): when two variables are not directly linkedthere must be a way of rendering them condition-ally independent; this is analogous to local inde-pendencies in BNs.We can block all influences byconditioning on its immediate neighbors. A nodeis conditionally independent of all nodes given itsimmediate neighbors. For positive distributions(those with nonzero probabilities for all instantia-tions), all three are equivalent.

To determine global independency I(H), iden-tify three sets of nodes A, B, andC. To test whetherconditional independence property A⊥ B|C, con-sider all possible paths from nodes in A to nodes inB. If all such paths pass through one or morenodes inC, then path is blocked and independenceholds.

MNs have a simple definition of independence:two sets of nodes are conditionally independentgiven a third set C if all nodes in A and B areconnected through nodes in C. BN independenceis more complex: it involves direction of arcs andinference problems. It is convenient to convertboth to a factor graph representation.

In going from distributions to graphs, ques-tions that arise are as follows: How to encodeindependencies in given distribution P in a graphstructure H? What sort of independencies to con-sider: global or local? Are we looking for anI-map, minimal I-map, or perfect map?

Partially Directed GraphsBNs with directed edges andMNswith undirectededges are both useful in different application sce-narios. It is possible to convert BNs to MNs usingmoralization in which edges are introducedbetween unrelated parent nodes. Converting aMN into a BN introduces much higher networkcomplexity. It is possible to unify both represen-tations by incorporating both directed/undirecteddependencies in the same PGM.

Conditional Random Fields (CRFs) are MNswith a directed dependency on some subset ofvariables. While aMN encodes a joint distributionover X, the same undirected graph can be used torepresent a conditional distribution P (Y |X),where Y is a set of target variables and X is a setof observed variables. It has an analog in directedgraphical models: viz., conditional BNs.

CRF nodes correspond to Y [ X and parame-terized as ordinary MNs. Can be encoded as a log-linear model with a set of factors f1(D1),..fm(Dm). Instead of P (Y, X) view it asrepresenting P (Y|X). To naturally represent a con-ditional distribution avoid representing a probabi-listic model over X, and disallow potentialsinvolving only variables in X.

To derive the CRF definition, from Bayes’ rule

we haveP YjXð Þ ¼ P Y,Xð ÞP Xð Þ . The numerator in terms

of the Gibbs’ definition of MN isP Y,Xð Þ ¼ 1Z Y,Xð Þ

~P Y,Xð Þwhere ~P Y,Xð Þ ¼ ∏ni¼1fI Dið Þ andZ Y,Xð Þ

¼PY,X~P Y,Xð Þ. The denominator isP Xð Þ ¼PY

P Y,Xð Þ ¼ 1Z Y,Xð Þ

PY~P Y,Xð Þ: Substituting back

we get,


P YjXð Þ ¼ 1

Z Xð Þ~P Y,Xð Þ: (2)

Where Z Xð Þ ¼PY~P Y,Xð Þ is the partition

function which is a function of X. Whereas aGibbs’ distribution factorizes into factors and apartition function Z, a CRF has a different value inthe partition function for every assignment x to X.

InferenceInference is the process of answering queries fromthe probabilistic model. Three types of queries arecommon:

1. Probability query: given values of some vari-ables, give distribution of another variable.This is the most common type of query. Thequery has two parts:• Evidence, E, a subset of variables and their

instantiation e.• Query variables, a subset Y of random vari-

ables in network. The inference task, whichis to determine P (Y |E = e), the posteriorprobability distribution over values y ofY conditioned on the factE= e, can be viewedas marginal probability estimation over Y inthe distribution we obtain by conditioning one: P YjE ¼ eð Þ ¼P

w�YP wjE ¼ eð Þ.

2. MAP (maximum a posteriori probability) query:what is the most likely setting of variables. Alsocalled MPE (most probable explanation). Mostlikely assignment to all nonevidence variablesW

¼ w� Eand MAP Wj eð Þ ¼ arg maxw

P w, eð Þ isthe value w for which P (W, e) is maximum.Instead of a probability we get the most likelyvalue for all remaining variables.

3. Marginal MAP query: when some variables areknown. Query does not concern all remainingvariables but a subset of them. Given evidenceE = e, task is to find most likely assignment toa subset of variables Y:MAP Yj eð Þ ¼ arg maxyP yj eð Þ . If Z = w � Y – E thenMAP Yj eð Þ ¼ arg maxy

Pz P Y,Zj eð Þ. Infer-

ence of marginal

MAP is more complex than MAP since it con-tains both summations (like in probabilityqueries) and maximizations (like in MAPqueries). Also, due to lack of MAP monotonicity,i.e., most likely assignmentM AP (Y1|e) might becompletely different from assignment to Y1 in MAP ({Y1, Y2}|e), we cannot use a MAP query togive a correct answer to a marginal map query.

The probability of evidence E = e canbe determined from a BN, in principle, as follows:

P E ¼ eð ÞPX=E

∏k

i¼1P Xij pa Xið Þð Þ j E¼e. This is an

intractable problem, one that is #P-complete. It istractable when tree-width is less than 25, but mostreal-world applications have higher tree-width;where tree-width is defined as the number of vari-ables in the largest clique. Approximations areusually sufficient (hence sampling), e.g.,when P (Y = y|E = e) = 0.29292, approximationyields 0.3.

Inference can be performed using either exactor approximate algorithms. Exact algorithms areexpressed as passing messages around graph.Approximate methods, which become necessarywhen there are a large number of latent variables,include variational methods and particle(sampling-based) methods.

Exact InferenceConsider graphs consisting of chains of random vari-ables, also known as Markov chains, e.g.,N = 365 days and Xi is weather (cloudy, rainy,snow) on a particle day i. In this case directed andundirected graphs are exactly the same since there isonly one parent per node (no additional links needed).The joint distribution has the form p wð Þ ¼ 1

ZC1, 2

X1,X2ð ÞC2, 3 X2,X3ð Þ . . .CN�1,N XN�1,XNð Þ: Wewish to evaluate the marginal distribution p(Xn) fora specific node part way along chain, e.g., what is theweather on November 11?

As yet there are no observed nodes. Therequired marginal is obtained summing the jointdistribution over all variables except Xn : p Xnð Þ¼P

X1

. . .PXN�1

PXNþ1

. . .PXN

p wð Þ:This is referred to asthe sum-product inference task. In the specific caseofN discrete variables with K states each: potential


functions are K � K tables, the joint distributionhas (N � 1)K2 parameters and there are KN valuesfor w. Evaluation of both joint and marginal isexponential with length N of chain (which makesit impossible for say K = 10 and N = 365).

Efficient evaluation involves exploiting condi-tional independence properties. Key concept usedis that multiplication is distributive over addition,i.e., ab + ac = a(b + c), where the LHS involvesthree arithmetic operations, while the RHS involvesonly two. Using this idea, rearrange order of sum-mations/multiplications to allow marginal to beevaluated more efficiently. Consider summationover XN. Potential CN �1,N (XN �1, XN) is theonly one that depends on XN. So we can performPXN

CN�1,N XN�1,XNð Þ to give a function of XN �1.

Use this to perform summation over XN �1. Eachsummation removes a variable from distributionor removal of node from graph. The total cost isO(N K2), which is linear in chain length versusexponential cost of naive approach. Thus we areable to exploit many conditional independenceproperties of simple graph. This calculation isviewed as message passing in graph. The keyinsight is that the factorization of the distributionallows performing local operations on the factorsrather than generating the entire distribution. It isimplemented using the variable elimination algo-rithm which sums out variables one at a time,multiplying factors necessary for that operation.

The sum-product algorithm evaluates anexpression for marginal probabilities expressedin the form Sw 6¼Xn

Пifi . Variable elimination canalso be used for evaluating the setting of variablesfor the largest probability – an inference problemwhich takes the form arg maxwПifi. It is known asthemax-sum algorithm which can be viewed as anapplication of dynamic programming to PGMs.

The sum-product and max-sum algorithmsprovide efficient and exact solutions to tree-structured graphs. For many applications wehave to deal with graphs having loops. An alter-native implementation based on the same variableelimination insight uses a more global data struc-ture for scheduling the operations. It is based onthe idea of clique trees. If the starting point is adirected graph, it is first converted to an

undirected graph by moralization. Next thegraph is triangulated by finding chord-less cyclescontaining four or more nodes and adding extralinks to eliminate such chord-less cycles. Next thetriangulated graph is used to construct the clique-tree whose nodes correspond to maximal cliques.A clique tree maps a graph into a tree by introduc-ing a node for each clique in the graph, where themaximum clique size is known as the tree-width.Finally a two-stage algorithm essentially equiva-lent to the sum-product algorithm is applied.However exact inference is exponential in spaceand time complexity with tree-width.

Approximate InferenceExact inference is often intractable – commonlydue to interactions between latent variables. Byregarding inference as an optimization problem,approximate inference algorithms can be derivedby approximating the optimization. We constructan approximation to the target factorized distribu-tionPF wð Þ ¼ 1

ZПifi Dið Þ that allows simpler infer-ence. It involves searching through a class of“easy” distributions to find an instance Q thatbest approximates PF, e.g., one that minimizesthe Kullback-Leibler divergence (also known asrelative entropy):D Qj jPFð Þ ¼ EQ ln Q

PF

h i. This is

equivalent to maximizing the energy functionalF ~PF,Q� � ¼PiEQ lnfi½ � þ HQ wð Þ.It is also known as the Evidence Lower Bound

(ELBO) since it has at most the same value as thedesired log-probability. It has two terms, the first ofwhich is known as the energy term and the secondterm is the entropy of Q. Assuming that inferenceis easy in Q, the expectations in the energy termshould be relatively easy to evaluate and theentropy term depends on the choice of Q. Queriescan then be answered using Q instead of PF.

Principal among methods that approach infer-ence as optimization are (i) variational methods,which are deterministic, and (ii) particle-basedapproximation which use stochastic numericalsampling from distributions.

Variational Methods The core idea of varia-tional methods is to maximize the energy


functional, or ELBO, over a family of functionsQ. The family is chosen so that it is easy tocompute EQ.

In the mean field approach to variational infer-ence Q is assumed to factorize into independentdistributions qi, i.e., Q = Пi qi. In the structuredvariational approach we impose any PGM on Q.Specifying how to factorize is handled differentlyin the discrete and continuous cases. In the dis-crete case we use traditional optimization tech-niques to optimize a finite number of variablesdescribing the Q distributions. In the continuouscase we use the calculus of variations over a spaceof functions and determine as to which functionshould be used to represent Q. Although calculusof variations is not used in the discrete case, it isstill referred to as being variational. Fortunately itis not necessary for practitioners to solve calculusof variations problems. Instead there is a generalequation for mean-filed fixed point updates.

Particle-BasedApproximate Inference Particle-based inference methods approximate the jointdistribution as a set of instantiations, called parti-cles. The particles can be full – involving com-plete assignments to all the network variables w,or collapsed – which specifies an assignment onlyto a subset of the variables. If we have samplesx[1], ...,x[m], we can estimate the expectation of afunction f relative to P byEP f½ � ¼ 1

M

PMi¼1 f x m½ �ð Þ

.The simplest method is forward sampling. It

involves sampling the nodes of a BN in someorder so that by the time we sample a node wehave values for all of its parents. The estimationtask is significantly harder when values of somevariables Z = z are observed, The obviousapproach of rejecting samples that are not com-patible with the evidence is infeasible since theexpected number of unrejected samples is smallparticularly for the small probabilities encoun-tered with BNs.

In importance sampling a factor w[m] = P (x[m])/Q(x[m]) is used as a correction weight to theterm f (x[m]) in computing EP [f]. The proposaldistribution Q is a mutilated version of the BN ofP in which each node Zi � Z has no parents and

the rest of the nodes have unchanged parentsand CPDs.

Gibbs sampling is a Markov Chain MonteCarlo method that generates successive samplesby fixing the values of all variables to the previoussample and generating the value of a new variableusing the conditional distribution. Unlike forwardsampling, Gibbs sampling applies equally well toBNs and MNs.

LearningA PGM consists of a graphical structure andparameters. There are two approaches toconstructing a model: (i) knowledge engineering:construct a network by hand with experts’ helpand (ii) machine learning: learn model from a setof instances. Hand-constructed PGMs have manylimitations: time taken to construct them varyfrom hours to months, expert time can be costlyor unavailable, the data may change over time, thedata may be huge and errors may lead to pooranswers.

Since inferring PGMs is an intractable prob-lem, i.e., NP -hard, it is necessary to developscaleable approximate learning methods. Existingmethods for structure learning are either score-based or constraint-based approaches. Mostexisting solutions are applicable only to pairwiseinteractions, and their generalization to arbitrarysize groupings of variables is needed.

In most applications of PGMs, the graphicalstructures are assumed to be either known ordesigned by human experts, thereby limiting themachine learning problem is one of parameterestimation. Structure learning is a model selectionproblem which requires defining a set of possiblestructures and a measure to score each structure.Learning as optimization is the predominantapproach with a hypothesis space consisting ofset of candidate models and an objective functionwhich is a criterion for quantifying preferenceover models. The learning task is to find a high-scoring model within its model class. Differentchoices of objective functions have ramificationto results of learning. The hypothesis space issuper-exponential (2O(n2)), with the situationworse for MNs since cliques can be of size greaterthan two.


Parameter EstimationParameter estimation is a building block for moreadvanced PGM learning: structure learning andlearning from incomplete data. The data set con-sists of fully observed instances of the networkvariables D = {x[1],, x[M]} .

Bayesian Networks In the case of a fixed BN theparameter estimation problem is decomposed intoa set of unrelated problems. Two main approachesto determine the CPDs are maximum likelihoodestimation and Bayesian parameter estimation. Inthe maximum likelihood approach, the likelihoodfunction is the probability that the model assigns tothe training data. For example, in the multinomialcase, where a variable X can take values x1, .., xK,

the likelihood function has the form L y : Dð Þ ¼Pky

M k½ �k , where M [k] is the number of times the

value xk appears among M samples, and the max-imum likelihood estimate is yk ¼ M k½ �

M .The Bayesian approach becomes useful when

the number of samples is limited. We begin with aprior distribution for the parameters and convert itto a posterior distribution based on the likelihoodof observed samples. For CPTs with multi-valueddiscrete variables a Dirichlet prior is useful since itis conjugate to the multinomial distribution. It hasthe form P (y)= Dirichlet(a1, ..., aK) where ak arehyper-parameters and a = Sk ak with E yk½ � ¼ ak

a .The posterior has the form Dirichlet(a1 + M [1],..., aK + M [K]). The hyper parameters play therole of virtual samples to avoid zero probabilitiesdue to lack of samples.

Markov Networks For MNs, the global parti-tion function induces entanglements of parame-ters. The problem is stated as one of determiningthe set of parameters y from D when the featuresF are known. For fixed structure problems theoptimization problem is convex, i.e., (i) local min-imum is the global minimum, (ii) set of all (global)minima is convex, and (iii) if strict convexityholds the minimum is unique. Thus it is possibleto use iterative numerical optimization, but at eachstep it requires inference which is expensive.

Structure Learning

Bayesian Networks BN structure learningmethods rely on three measures: (i) deviancefrom independence between variables, (ii) a deci-sion rule that defines a threshold for the deviancemeasure to determine whether the hypothesis ofindependence holds, and (iii) a score for thestructure.

Deviance from independence between a pair ofvariables is provided by the chi-squared test ofindependence. The Pearson’s chi-squared statisticbetween variables Xi, Xj given a data set D ofM samples is

dw2 Dð Þ ¼XXi,Xj

M Xi,Xj

� ��M � P Xið Þ � P Xj

� �� 2M � P Xið Þ � P Xj

� �(3)

When the variables are independent dw2(D) = 0. It has a larger value when the jointcount M [Xi, Xj] and the expected count (underthe independence assumption) differ. Anotherdeviance measure is mutual information (whichis equivalent to the Kullback-Leibler distance)between the joint distribution and the product ofthe marginals:

dw2 Dð Þ ¼XXi,Xj

M X,X½ �log M Xi,Xj

� �M X½ � �M X½ � (4)

When the variables are independent dI Dð Þ ¼ 0

and a larger value otherwise.A decision rule accepts the hypothesis that the

variables are independent if the deviance measureis less than a threshold and rejects the hypothesisotherwise. The threshold is chosen such that thefalse rejection probability has a given value, say0.05 (called the p-value).

Examples of structure scores over a data set are

the log-likelihood scoreL G : y : Dð Þ ¼ ‘ yG : D� �

¼PD

Pmi¼1 logP Xij paXið Þ, and the Bayesian

Information Criterion (BIC) which penalizesmore complex structures: scoreBIC G : y : Dð Þ ¼ ‘


yG : D� �

� logM2

Dim Gð Þ, where Dim is the num-

ber of independent parameters in G.Approaches to BN structure learning are

constraint-based, score-based, and Bayesianmodel averaging. In constraint-based learningthe BN is viewed as a representation of indepen-dencies, but it is sensitive to failures of individualindependence tests, i.e., if one test returns a wronganswer it misleads the network construction pro-cedure. In score-based learning, the BN is viewedas specifying a statistical model where each struc-ture given a score, with optimization to findhighest score; but search may not have an elegantand efficient solution. Bayesian model averaginggenerates an ensemble of possible structures andaverages the prediction of all possible structures;due to the immense number of structures, approx-imations are needed.

Markov Networks The problem is to identifythe MN structure with a bounded complexity,which most accurately represents a given proba-bility distribution, based on a set of samples fromthe distribution. MN complexity is the number offeatures in the log-linear representation of theMN. This problem, which is NP -hard, has severalsuboptimal solutions which may be characterizedas either constraint-based or score-based.

In the constraint-based approach, conditionalindependences of variables are tested on a givendata set. A simple algorithm for structure learningis to determine the empirical mutual informationbetween all pairs of variables and to keep onlythose edges whose values exceed a threshold.Since the constraint-based approach lacks noiserobustness, requires many samples, and only con-siders pairwise dependencies, the score-basedapproach is considered.

The score-based approach computes a score fora given model structure, e.g., log-likelihood withthe maximum likelihood parameters. One suchscore is e(F , y, D ) � ||y||1, where the secondterm is L1 regularization to prevent over-fitting.The goal is to determine the set of features as wellas the parameters. A search algorithm can then beused to obtain the MN structure with the optimalscore. The greedy algorithm starts from the MN

without any features (the model where all vari-ables are disjoint). Features are then introduced tothe MN one by one. At each iteration, a feature isselected that brings maximum increase in theobjective function value. The search can bespeeded by limiting the number of candidate fea-tures to enter the MN, e.g., features whose empir-ical probability differs most from their expectedvalue with respect to the current MN.

Key Applications

PGMs have been widely used in several fields formodeling and prediction, e.g., text analytics,image restoration, and computational biology.They are a natural tool for handling uncertaintyand complexity which occur throughout appliedmathematics and engineering.

PGMs can account for model uncertainty, mea-surement noise and integrate diverse sources ofdata. PGMs can be used predict the probability ofobserved and unobserved relationships in a net-work. Fundamental to the idea of a graphicalmodel is the notion of modularity where a com-plex system is built by combining simpler parts.Probability theory provides the glue whereby theparts are combined, ensuring that the system as awhole is consistent providing ways to interfacemodels to data. The graph-theoretic side providesan intuitively appealing interface by whichhumans can model highly interacting sets of vari-ables. The resulting data structure lends itself nat-urally to designing efficient general-purposealgorithms. PGMs provide the view that classicalmultivariate probabilistic systems are instances ofa common underlying formalism, e.g., mixturemodels, factor analysis, hidden Markov models,Kalman filters, and Ising models. PGMs areencountered in systems engineering, informationtheory, pattern recognition, and statisticalmechanics. Other benefits of the PGM view arethat specialized techniques in one field can betransferred between communities and exploited,and they provide a natural framework for design-ing new systems.


VisualizationPGMs are useful to visualize structure of proba-bilistic models. Joint distributions can be factoredinto conditional distributions using product ruleand expressed as BNs.

Generative ModelsPGMs can be used to generate samples, e.g.,ancestral sampling is a systematic way of gener-ating samples from BNs. They can be used asgenerative models for data, therebycircumventing often stringent privacy regulations.

Genetic InheritanceBNs can be used to model genetic inheritance.Consider transmission of certain properties suchas blood type from parent to child. Blood type of achild B(c) is an observable quantity, called pheno-type, that depends on the genetic makeup G(c), ofa person called a genotype. There are three typesof CPDs for genetic inheritance. The penetrancemodel P (B(c)|G(c)) describes probabilities ofdifferent phenotypes given a person’s genotype:it is deterministic for blood type. The transmissionmodel is P (G(c)|G(p), G(m)), where c is a person,p andm are the person’s father andmother, respec-tively: each parent is equally likely to transmiteither of two alleles to child. Genotype priors areP (G(c)). Real models are more complex. Pheno-types for late-onset diseases are not a determinis-tic function of genotype. A particular genotypemay have a higher probability of a disease. Thegenetic makeup of an individual is determined bymany genes. Some phenotypes depend on manygenes. Multiple phenotypes depend on manygenes. An example BN representing geneticinheritance of DNA is shown in Fig. 1.

Social NetworksOn-line social networks (OSNs) provide a clearway of analyzing the structure of whole socialentities (Wasserman and Faust 1994). An (OSN)can be viewed as a graph where the nodes repre-sent individuals or organizations (actors) and theedges are dyadic ties that represent the dependen-cies between them (Fig. 2a). Actors have a set ofattributes, e.g., gender, age, hobbies. OSNs usu-ally have higher order groupings of actors, e.g.,

travel lovers, sports club, and coffee lovers(Fig. 2b). They can be represented as MNs usinga factor graph representation (Fig. 2c). The depen-dencies need not follow the OSN links. Nodes canalso be links between actors taking values {0, 1}or [0, 1], and dependencies are connections tosame actors.

Some inference (predictive modeling) taskswith OSNs are: (i) Predictive Modeling: strengthof a given connection in the future given currentstate of the network (structure, attributes) and itsprevious history, (ii) Group/Community Detec-tion: reveal groups of users that are interconnectedaccording to a given criteria given current net-work structure and attributes, (iii) Behavior Pat-tern Extraction: reveal hidden connectionsbetween attribute values of users or betweensome of their actions and attributes given currentstructure of the OSN and user attribute values andhistory of changes in the network, (iv) Actor Clas-sification: label each user according to somecriteria given network structure and user attributevalues (Fig. 3a), (v) Link Classification: label eachlink according to some criterion given networkstructure and user attribute values and history ofchanges (Fig. 3b), and (vi) Artificial NetworkData Generation: construct a generative modelto produce artificial data sets resolving privacyissues that occur while working with real OSNdata (Fig. 3c).

Future Directions

With the enormous amounts of data being gener-ated from instruments, cameras, Internet transac-tions, email, genomics, etc., statistical inferencewith big heterogeneous data sets is becomingincreasingly important. When the number of vari-ables becomes large, the amount of data neededfor exact statistical modeling becomes impracti-cal. Heterogeneity of attributes describing com-plex relations gives rise to a number of uniquestatistical and computational challenges, e.g., thenumber of parameters needed to model the distri-butions becomes exponential and the parameterinference algorithms become intractable. This iswhere PGMs become useful as they provide


Probabilistic Graphical Models, Fig. 1 Genetic inheritance based on DNA represented as a BN: (a) a family tree and(b) Bayesian network of genotypes and phenotypes


Probabilistic Graphical Models, Fig. 2 Statisticaldependencies between variables in an OSN: (a) pairwisedependencies, (b) higher-order groupings in an affiliation

network, and (c) a possible MN factor graph where filledcircles represent actors with known gender and squaresrepresent factors


approximations of exact distributions. An exam-ple of big data that can be naturally analyzed usingPGMs are OSNs of kinship, email, affiliationgroups, mobile communication devices, biblio-graphic citations, and business interactions.

Acknowledgments The author wishes to thank his teach-ing and research assistants for the PGM course (CSE 674 atthe University at Buffalo). In particular, DmitryKovalenko, Yingbo Zhao, Chang Su, and Yu Liu formany discussions.

Cross-References

▶Gibbs Sampling▶Markov Monte Carlo Model▶Models of Social Networks▶ Probabilistic Analysis

References

Pearl J (1988) Probabilistic reasoning in intelligent sys-tems: networks of plausible inference. MorganKaufmann, San Francisco

Wasserman S, Faust K (1994) Social network analysis inthe social and behavioral sciences. In: Social networkanalysis: methods and applications. Cambridge Univer-sity Press, Cambridge, p 127

Recommended ReadingBishop C (2006) Pattern recognition and machine learning.

Springer, New York; has a chapter on graphical modelswhich provides a good introduction

Koller D, Friedman N (2009) Probabilistic graphicalmodels: principles and techniques. MIT Press, Cam-bridge, MA; a detailed treatise on PGMs

Srihari S. Lecture slides and videos on machine learningand PGMs at http://www.cedar.buffalo.edu/~srihari/CSE574

Probabilistic Graphical Models, Fig. 3 Some statistical inference problems in OSN big data: (a) actor classification,(b) link classification, and (c) data generation


http://link.springer.com/Gibbs Sampling

http://link.springer.com/Markov Monte Carlo Model

http://link.springer.com/Models of Social Networks

http://link.springer.com/Probabilistic Analysis

http://www.cedar.buffalo.edu/%7Esrihari/CSE574

http://www.cedar.buffalo.edu/%7Esrihari/CSE574

Probabilistic Graphical Modelssrihari/papers/PGM-ESNAM.pdfProbabilistic graphical models (PGMs),...

Documents

Transcript of Probabilistic Graphical Modelssrihari/papers/PGM-ESNAM.pdfProbabilistic graphical models (PGMs),...