Using Graphs to Describe Model Structuresrihari/CSE676/16.2 GraphRepresentation.pdf•Edge has no...
Transcript of Using Graphs to Describe Model Structuresrihari/CSE676/16.2 GraphRepresentation.pdf•Edge has no...
Deep Learning Srihari
Topics in Structured PGMs for Deep Learning0. Overview1.Challenge of Unstructured Modeling2.Using graphs to describe model structure3.Sampling from graphical models4.Advantages of structured modeling5.Learning about Dependencies6.Inference and Approximate Inference7.The deep learning approach to structured probabilistic models
1. Ex: The Restricted Boltzmann machine
2
Deep Learning Srihari
Topics in Using Graphs to Describe Model Structure
1. Directed Models2. Undirected Models3. The Partition Function4. Energy-based Models5. Separation and D-separation6. Converting between Undirected and
Directed Graphs7. Factor Graphs
3
Deep Learning Srihari
Graphs to describe model structure• Model structure is described using graphs:
– Each node represents a random variable– Each edge represents a direct interaction
• These direct interactions imply other indirect interactions• But only direct interactions need be explicitly modeled
4
Deep Learning Srihari
Types of graphical models
• More than one way to describe interactions in a probability distribution using a graph
• Graphical models can be largely divided into two categories– Models based on directed acyclic graphs– Models based on undirected graphs
5
Deep Learning Srihari
1. Directed Models
• One type of structured probabilistic model is the directed graphical model
• Also known as a belief network or a Bayesian network
• The term Bayesian is used since the probabilities can be judgmental– They usually represent degrees of belief rather than
frequencies of events
6
Deep Learning Srihari
Example of Directed Graphical Model
• Relay race example
• Bob’s finishing time t1 depends on Alice’s finishing time t0, Carol’s finishing time t2depends on Bob’s finishing time t1
7
Deep Learning Srihari
Meaning of directed edges
• Drawing an arrow from a to b means we define a conditional probability distribution (CPD) over b via a conditional distribution with a as one of the variables on the right side of the conditional bar– i.e., distribution over b depends on the value of a
8
Deep Learning Srihari
Formal directed graphical model
• Defined on variables x is defined by a directed graphical acyclic graph G– whose vertices are random variables in the model
and a set of local CPDsp(xi|PaG(xi))• where PaG(xi) gives the parents of xi in G
• The probability distribution over x is given by
• In the relay race example– p(t0,t1,t2)=p(t0)p(t1|t0)p(t2|t1) 9
p(x) = p(x
ii∏ |Pa
G(x
i))
Deep Learning Srihari
Savings achieved by directed model• If t0, t1 and t2 are discrete with 100 values then
single table would require 999,999 values– By making tables for only conditional probabilities
we need only 18,999 values• To model n discrete variables each having k
values, cost of a single table is O(kn)• If m is maximum no. of variables appearing on
either side of conditioning bar in a single CPD then cost of tables for directed PGM is O(km)– So long as each variable has few parents in graph,
distribution represented by few parameters
Deep Learning Srihari
2. Undirected models• Directed graphical models give us a language
to describe structured probabilistic models• Another language is: undirected models
– Synonyms: Markov Random Fields, Markov Nets– Use graphs whose edges are undirected
• Directed models are useful when there is clear directionality
• When interactions have no clear directionality, more appropriate to use an undirected graph– Often when we know causality and causality flows
in one direction 11
Deep Learning Srihari
Models without clear direction
• When interactions have no clear direction in interactions, or operate in both directions, it is appropriate to use an undirected model
• An example with three binary variables described next
12
Deep Learning Srihari
Ex: Undirected Model for Health• A model over three binary variables:
– Whether or not you are sick, hy– Whether or not your coworker is sick, hc– Whether or not your roommate is sick, hr
• Assuming coworker and roommate do not know each other, very unlikely one of them will give a cold to the other– Event is so rare we do not model it
• There is no clear directionality either• This motivates using an undirected model
13
Deep Learning Srihari
The health undirected graph
• You and your rommmate may infect each other with a cold
• You and your work colleague may do the same• Assuming room-mate and colleague do not
know each other they can only get infected through you
14
Deep Learning Srihari
Undirected graph definition• If two variables directly interact with each other
then the nodes are connected• Edge has no arrow and has no CPD• An undirected PGM is defined on a graph G
– For each clique C in the graph, a factor ϕ(C), or clique potential, measures the affinity for being in each of their joint states• A clique is a subset of nodes all connected to each other
– Together they define an unnormalized distribution
15 !p(x) = φ(C)
C∈G∏
Deep Learning Srihari
Efficiency of Unnormalized Distribution
• Unnormalized probability distribution is efficient to work with so long as the cliques are small
• It encodes that states with higher affinity ϕ(C)are more likely
• Since there is little structure to the definition of the cliques, there is no guarantee that multiplying them together will yield a probability distribution
16
Deep Learning Srihari
Reading factorization information from an undirected graph
• This graph (with five cliques) implies that
– for an appropriate choice of the ϕ functions• Example of clique potentials shown next 17
p(a,b,c,d,e, f ) = 1
Zφ
a,b(a,b)φ
b,c(b,c)φ
a,d(a,d)φ
b,e(b,e)φ
e,f(e, f )
Deep Learning Srihari
Ex: Clique potential
• One clique is between hy and hc– Factor for this clique can be defined by a table
• Similar factor needed for the other clique between hy and hc
18
A state of 1 indicates good healthWhile state of 0 indicates poor healthBoth are usually healthy, so the correspondingstate has highest affinityState of only one being being sick has lowest affinityState of both being sick has higher affinity thanone being sick
hy= health of youhr= health of roommatehc=health of colleague
ϕ(hy,hc)
Deep Learning Srihari
• The unnormalized probability distribution – Is guaranteed to be non-negative everywhere– It is not guaranteed to sum or integrate to 1
• To obtain a valid probability distribution we must use the normalized (or Gibbs) distribution
– Where Z causes the distribution to sum to one
• Z is a constant when the ϕ functions are constant• If ϕ has parameters then Z is a function of those
parameters, commonly written without arguments• Known as the partition function in statistical physics
3. The partition function
p(x) = 1
Zp̂(x)
Z = !p(x)dx∫
!p(x) = φ(C)
C∈G∏
Deep Learning Srihari
Intractability of Z
• Since Z is an integral or sum over all possible values of x it is intractable to compute
• In order to compute a normalized probability of an undirected model:– Model structure and definitions of ϕ functions must
be conducive to computing Z efficiently– In deep learning applications Z is intractable and we
must resort to approximations
20
Deep Learning Srihari
Choice of factors• When designing undirected models important to
know that for some factors, Z does not exist!1. If there is a single scalar variable x ε R and we
choose a single clique potential ϕ(x)= x2 then
• This integral diverges– Hence there is no probability distribution for this choice
1. The choice of a parameter of the ϕ functions determines whether the distribution exists1.For ϕ(x ; β)=exp(-βx2), the β parameter determines
whether Z exists– Positive β defines a Gaussian over x– Other values of β make ϕ impossible to normalize 21
Z = x 2 dx∫
Deep Learning Srihari
Key difference between BN & MN• Directed models are:
– defined directly in terms of probability distributions• From the start
• Undirected models are:– Defined loosely in terms of ϕ functions that are then
converted into probability distributions• This changes intuitions to work with these models
– One key idea to keep in mind in working with MNs:• Domain of variables has a dramatic effect on kind of
probability distributions a given set of ϕ functions corresponds to
– We will see how we can define distributions for different domains22
Deep Learning Srihari
What distribution does an MN give?• Consider an n-dimensional random variable x={xi}i=1,..,n
• And an undirected model parameterized with biases b• Suppose we have one clique for each xi: ϕ(i)(xi)=exp(bixi)
x1 xi xn
• What kind of probability distribution is modeled?
• The answer is that we do not have enough information• Because we have not specified the domain of x
– Three example domains are: 1. x ε Rn, an n-dimensional vector of real values
2. x ε {0,1}n, an n-dimensional vector of binary values3.Domain of x is the set of elementary basis vectors {[1,0,..0],[0,1,..,0],.,[0,0,..,1]}
p(x) = 1
Zp̂(x)
!p(x) = φ(i)(x
i)
C∈G∏ = exp(b
1x
1+ ..+ b
nx
n)
Z = !p(x)dx∫
Deep Learning Srihari
Effect of domain of x on distribution• We have n random variables, x={xi}i=1,..,n
• For each xi: ϕ(i)(xi)=exp(bixi)
• What kind of probability distribution is modeled?
1.If xεRn diverges and no probability distribution exists2. If xε{0,1}n p(x) factorizes into n independent distributions with
p(xi=1)=σ(bi), where– Each independent distribution is a binomial with parameter σ(bi)
3. If domain of x is the set of basis vectors {[1,0,..0],[0,1,..,0],..,[0,0,..,1]}thenp(x)=softmax(b)
– So a large value of bi reduces to p(xj)=1 for j ≠i, i.e., multiclass
• Often by careful choice of domain of x we can obtain complicated behavior from a simple set ϕ functions
24
Z = !p(x)dx∫
p(x) = 1
Zp̂(x)
!p(x) = φ(i)(x
i)
C∈G∏ = exp(b
1x
1+ ..+ b
nx
n)
Z = !p(x)dx∫
σ(x) = 1
1 + e−x= ex
1 + ex
Deep Learning Srihari
4. Energy-based Models (EBMs)• Many interesting theoretical results of
undirected models depend on assumption that
• We can enforce this using an EBM where
– E(x) is known as the Energy function– Because exp(z)>0 ∨z, no energy function will result
in a probability of zero for any x• If we were to learn clique potentials directly we would
need to impose constraints for minimum probability value• By learning the energy function we can use
unconstrained optimization: probabilities can approach 0
∀x, !p(x) > 0
!p(x) = exp(−E(x))
!p(x) = φ(C)
C∈G∏
Deep Learning Srihari
Boltzmann Machine Terminology• Any distribution of the form
– is referred to as a Boltzmann distribution• For this reason, many energy-based models
are referred to as Boltzmann machines– No consensus on when to call it a energy-based
model or a Boltzmann Machine• Boltzmann machines first referred to only binary variables
– Today mean-covariance restricted Boltzmann Machines deal with real-valued variables• Boltzmann Machines refer to models with latent variables
and those without are referred to as MRFs or log-linear models 26
!p(x) = exp(−E(x))
Deep Learning Srihari
Cliques,factors and energy• Cliques in the undirected graph correspond to
factors in the uunnormalized probability function– Cliques in undirected graph also correspond to
different terms of an energy function– Because exp(a)exp(b)=exp(a+b) different cliques
in undirected graph correspond to different terms of the energy function• i.e., energy-based model is a special Markov network• Exponentiation makes each term of the energy function
correspond to a factor for a different clique– Reading the form of the energy function from an
undirected graph is shown next 27
Deep Learning Srihari
Graph and Corresponding Energy
• This graph (with five cliques) implies thatE(a,b,c,d,e,f)= Ea,b(a,b)+Eb,c(b,c)+Ea,d(a,d)+Eb,e(b,e)+Ee,f(e,f)
We can obtain the ϕ functions by setting each ϕ to the exponential of the corresponding negative energy, e.g., ϕa,b(a,b)=exp(-E(a,b))
28
Deep Learning Srihari
Energy-based Model as Experts• An energy based model with multiple terms in
its energy function can be viewed as a product of experts
• Each term corresponds to a factor in the probability distribution– Each term determines whether a soft constraint is
satisfied– Each expert may impose only one constraint that
concerns a low-dimensional projection of the random variables• When combined by multiplication of probabilities, the
experts together enforce a high-dimensional constraint
Deep Learning Srihari
Role of negative sign in energy
• The negative sign in serves no functional purpose from a ML perspective
• This sign could be incorporated into the definition of the Energy function
• It is there mainly for compatibility with physics literature
• Some ML researchers omit the negative sign and refer to the negative energy as harmony
30
!p(x) = exp(−E(x))
Deep Learning Srihari
Free Energy instead of Probability
• Many algorithms that operate on probabilistic models do not need to compute pmodel (x) but only
• For energy-based models with latent variables h, these algorithms are phrased in terms of the negative of this quantity, called the free energy
• Deep learning prefers this formulation31
log !pmodel
(x)
F(x) = − log exp −E(x,h)( )
h∑
Deep Learning Srihari
RBM as an Energy Model
32
Binary version of RBMObserved layer: set of nv binary r.v.s, vLatent or hidden set of nh binary r.v.s hIts energy function is
E(v,h)= -bTv –cTh –vTWhwhere b, c and W are unconstrained, real-valued learnable parameters
Thus model is divided into two groups of units v and hand the interaction between them is described by matrix W
Joint-probability distribution is specified by the energy function:P(v=v,h=h)=(1/Z)exp(-E(v,h))
The energy function for an RBM is E(v,h)= -bTv –cTh –vTWhZ is the partition function. Z = Σv Σh E(v,h)Since Z is intractable P(v) is also intractable
Although P(v) is intractable, bipartite structure of RBM has special property that Conditionals P(h|v),P(v|h) are factorial & easily computed
Deep Learning Srihari
5. Separation and D-Separation
• Edges in a directed graph tell which variables directly interact
• We often need to know which variables indirectly interact
• Some of these interactions can be enabled or disabled by observing other variables
• More formally we would like to know which variables are conditionally independent of each other given the values of other sets of variables
33
Deep Learning Srihari
Separation in undirected models• Identifying conditional independences is very
simple in the case of undirected models– In this case conditional independence implied by
the graph is called separation– Set of variables A is separated from variables B
given third set of variables S if the graph implies that A is independent of B given S
– If two variables a and b are connected by a path involving only unobserved variables then those variables are not separated• If no path exists between them, or all paths contain an
observed variable then they are separated 34
Deep Learning Srihari
Separation in undirected graphs
• b is shaded to indicate it is observed
• b blocks path from a to c, so a and c are separated given b
• There is an active path from a to d, so a and dare not separated given b
35
Deep Learning Srihari
Separation in Directed Graphs
• In the context of directed graphs, these separation concepts are called d-separation– d stands for “dependence”
• D-separation is defined the same as separation for undirected graphs:
• A set of variables A is d-separated from a set of variables B given a third set of variables S if the graph structure implies that A is independent of B given S
36
Deep Learning Srihari
Examining Active Paths
• Two variables are dependent if there is an active path between them
• They are d-separated if there is no path between them
• In directed nets determining whether a path is active is more complicated
• A guide to identifying active paths in a directed model is given next
37
Deep Learning Srihari
All active paths of length 2
38
Active paths between random variables a and b
Deep Learning Srihari
Reading properties from a graph
39