Variational Inference in Graphical ModelsVariational Inference in Graphical Models Nikki Levering...

Variational Inference in Graphical Models

Nikki Levering

July 13, 2017

Bachelor Thesis

Supervisor: dr. B.J.K. Kleijn

Korteweg-de Vries Instituut voor Wiskunde

Faculteit der Natuurwetenschappen, Wiskunde en Informatica

Universiteit van Amsterdam

Abstract

A graphical model is a graph in which the nodes index random variables and the edgesdefine the dependence structure between the random variables. There are different meth-ods of computing probability distributions in such graphical models. However, thereare graphical models in which exact algorithms take in the best case exponential timeto compute probabilities, making the use of approximating methods desirable in largegraphical models. Variational inference is an approximation method that uses convexanalysis to transform the computation of a probability in an optimization problem, ingraphical models where the conditional probability distributions are exponential fami-lies. This optimizing problem regards the cumulant function. However, optimizing isoften also difficult, because the set that is optimized over is often too complex. MeanField approximation deals with this problem by only taking a subset of the original setthe optimum is taken over. This results in a lower bound for the cumulant functionthat needs to be maximized. Maximizing this lower bound is equal to minimizing theKullback-Leibler divergence between the true distribution and all tractable distributions.There are certain algorithms that deal with minimizing this Kullback-Leibler divergencein graphical models, of which Variational Message Passing is one.

Title: Variational Inference in Graphical ModelsAuthor: Nikki Levering, [email protected], 10787992Supervisor: dr. B.J.K. KleijnSecond reader: prof. dr. J.H. van ZantenDate: July 13, 2017

Korteweg-de Vries Instituut voor WiskundeUniversiteit van AmsterdamScience Park 904, 1098 XH Amsterdamhttp://www.science.uva.nl/math

2

http://www.science.uva.nl/math

Contents

1. Introduction 4

2. Graphical Models 72.1. Directed Graphical models . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2. Undirected Graphical models . . . . . . . . . . . . . . . . . . . . . . . . 10

3. Exact Algorithms 123.1. The elimination algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2. NP-completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4. Variational Inference 174.1. Transforming the cumulant function . . . . . . . . . . . . . . . . . . . . . 184.2. Mean Field Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 224.3. Variational Message Passing . . . . . . . . . . . . . . . . . . . . . . . . . 24

5. Applications 285.1. Approximation of the posterior . . . . . . . . . . . . . . . . . . . . . . . 285.2. Variational Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6. Conclusion and Discussion 34

7. Popular summary (in Dutch) 36

Bibliography 38

A. Proofs, chapter 2 40

B. Proof, chapter 3 44

C. Proofs, chapter 4 46

3

1. Introduction

Inference in statistics is a process in which data is analyzed to deduce what the propertiesof the underlying distribution of the data are. In this thesis we will focus on inferencein graphical models. A directed (resp. undirected) graphical model is a directed acyclic(resp. undirected) graph G = (V , E) (where V are the nodes and E the edges of thegraph). The nodes index a collection of random variables Xv : v ∈ V. The edges in agraphical model indicate the dependence structure in the graphical model. The intuitiveinterpretation of the dependence structure given by these edges is that if there is an edgefrom a node X1 to a node X2, the distribution of X2 will depend on the value of X1. Anexample of a directed and undirected graphical model are shown in figure 1.1.

X2 X3 X4

X1

Y1 Y2 Y3

Figure 1.1.: Example of a directed graphical model (left), with nodes that index therandom variables X1, . . . , X4, and an example of an undirected graphicalmodel (right), whereby the random variables Y1, . . . , Y3 are indexed by thenodes.

The goal of statistical inference in graphical models is to compute the underlying distri-butions in the graphical model. When only the graphical model but no data is available,one is often interested in a marginal distribution or the joint probability distributionfor the model. However, when the values of some of the random variables are alreadyobserved, one often wants to compute the probability distribution of the variables not ob-served, conditioned on the observed variables. The nodes corresponding to these observedvariables are then called evidence nodes.

Graphical models are not only used in Mathematics, but also in Physics, ArtificialIntelligence and Biology. For example, in Biology one makes use of the so-called QMR-DT database, which is a probablistic database used for the diagnosis of a disease inMedicine [7]. The database can be represented as bipartite graphical model, where onelayer of nodes represents symptoms of diseases and the other layer of nodes representsthe diseases itself and where a directed edge is present from disease to symptom if it isknown that that disease can cause that symptom (see figure 1.2). The data is given bythe observed symptoms of a patient and the marginal distributions for the diseases. Nowstatistical inference can be performed to deduce the conditional probability of findingthese symptoms given the diseases, but other underlying distributions of the data canalso be deduced.

4

From this example in Medicine it can be seen that statistical inference in graphicalmodels has applications in other fields of study. This makes it a very relevant subject.

There are several methods of performing statistical inference in graphical models.These can be separated in exact and approximation methods, where the focus in thisthesis will be on certain methods of the latter. Exact statistical inference computes theunderlying distribution in an exact way, meaning that the properties that are deduced forthe underlying distribution of the given data, are the properties of the true distribution.Approximation methods for statistical inference only approximate the true distribution,which would make these kind of methods less desirable. However, as will be deduced inthis thesis, exact algorithms for performing statistical inference in graphical models aretime consuming, especially when the models are large, and therefore the approximationmethods are nevertheless very relevant.

As said, we will in this thesis focus on certain methods of statistical inference thatapproximate the true distribution. These methods are called variational methods anduse of these methods is called variational inference. In performing variational inference ingraphical models we will in this thesis focus on those graphical models for which the jointprobability distribution is finitely- parameterized and in exponential family form. Thismeans we can write, for a graphical model with nodes that index the dependent randomvariables X1, . . . , Xn (with realized values x1, . . . , xn) and some parameter vector θ:

pθ(x1, . . . , xn) = exp 〈θ, φ(x1, . . . , xn)〉 − A(θ),

where φ(x1, . . . , xn) is a vector of complete and sufficient statistics and A(θ) is defined as

A(θ) = log

∫exp 〈θ, φ(x1, . . . , xn)〉ν(d(x1, . . . , xn)).

For more details about the exponential family, read the beginning of chapter 4.The idea behind variational inference is now to identify a probability distribution as the

solution of an optimization problem. For exponential families it turns out that computingA(θ) can be identified with the following optimization problem:

A(θ) = supµ∈M〈θ, µ〉 − A∗(µ), (1.1)

where the set of realizable mean parameters M and the conjugate dual A∗(µ) will bedefined in section 4.1. So computing the underlying distribution of the data in a graphicalmodel where the joint probability distribution is in exponential family form, is thenreduced to solving the optimization problem in (1.1). However, solving this problem isnot necessarily easy, because A∗(µ) does often not have an explicit form andM could besuch that optimization over this set is hard.

A method for solving this optimization problem is given by Mean Field approximation,which uses only distributions for which M and A∗(µ) can be identified, called tractabledistributions. As it turns out, solving the optimization problem then equals minizing theso-called Kullback-Leibler divergence, which is defined as

D(q||p) =

∫Xq(x)

[log

q(x)

p(x)

]ν(dx),

5

Symptoms

Diseases

Figure 1.2.: The structure of the QMR-DT graphical model. A bipartite graphical modelwhere the upper layer of the nodes represent diseases and the layer belowrepresents symptoms. The grey nodes are the symptoms that have beenobserved. [7]

for the true distribution p and a tractable distribution q. So variational inference in graph-ical models with joint probability distribution in exponential family form is just mini-mizing the Kullback-Leibler divergence between the true distribution and all tractabledistributions.

In this thesis the goal will be to study how variational inference in graphical modelsworks. To achieve this goal, in chapter 2 we will first look at graphical models andhow to define probability distributions for the random variables in the graphical models.Then in chapter 3 we will look at exact inference by giving an example of an exactalgorithm. We will also prove the fact that exact algorithms are time consuming. Nextwe will analyze variational inference and Mean Field approximation in chapter 4, within the same chapter the Variational Message Passing algorithm, which is an exampleof an algorithm performing variational inference. In the last chapter, chapter 5, somestatistical applications of variational inference are shown with focus on applications inBayesian statistics.

6

2. Graphical Models

Before one can perform statistical inference in graphical models, one needs to knowhow the probabilities in such models are defined. So a structure on the models hasto be defined to indicate how probabilities in graphical models can be computed. Thedependence of the random variables in a graphical models is given by the edges in thegraph. This structure differs for directed and undirected graphical models, so these caseswill be treated separately.

2.1. Directed Graphical models

In a directed graphical model the direction of the edges between the nodes defines thedependencies between the random variables indexed by the nodes.

Definition 2.1.1. Let G = (V , E) be a directed acyclic graph, where V = V1, . . . , Vn,n ∈ N.

(i) When Vi → Vj ∈ E , for some i, j ∈ 1, . . . , n, Vi is called the parent of Vj in G.Denote the set of all parents for some node Vi as πVi (i ∈ 1, . . . , n).

(ii) Vj is called a descendant of Vi if there exists a directed path from Vi to Vj,i, j ∈ 1, . . . , n. [9]

In the directed graphical model in figure 1.1 X1 → X2, so X1 is a parent of X2. Theabsence of another directed edge towards X2 now gives πX2 = X1. In the graph X1 hasa path to X2, X3 and X4, leading to the conclusion that X2, X3 and X4 are descendantsof X1.

To compute probabilities in a directed graphical model, the semantics of this modelhave to be defined. There are two ways to define the semantics of a directed graphicalmodel: one regarding local probabilities and one regarding the global probabilities in themodel. To compute probabilities, consistency between both definitions is needed. As itturns out the two definitions are indeed equivalent, which means probabilities in a directedgraphical model can be computed. A directed graphical model satisfying either of thedefinitions is called a Bayesian network. In this section the two definitions regarding thesemantics of a Bayesian network are given together with a theorem which proves theirequivalence.

Definition 2.1.2. Let G = (V , E) be a directed acyclic graphical model with Xv : v ∈ V a collection of random variables indexed by the nodes. Then G encodes the following set

7

of conditional independence assumptions, called the local independencies or local Markovassumptions, for i, j ∈ N where Xj is not a descendant of Xi:

p(Xi, Xj|πXi) = p(Xi|πXi)p(Xj|πXi).

Denote for this conditional independence (Xi ⊥ Xj|πXi). [9]

From the definition follows that given its parents, each node Xi is conditionally inde-pendent of the nodes which are not its descendants. A directed graphical model is perdefinition acyclic. This property is very important, because this means that for a node Xi

the node Xj can not both be a descendant and a parent of Xi. If this would be possible,Xi would be a parent of Xj, Xj again a parent of Xi, which also implies Xi is a parentof Xi itself. This gives rise to an infinite loop, where Xi is dependent of Xj, Xj of Xi,Xi again of Xj, and so on. To avoid this situation it is indeed important for a directedgraphical model to be acyclic.

Y1 Y2 Y3X3X2

X1

Z3

Z2Z1

Figure 2.1.: The three possible directed acyclic graphical models consisting of three nodes.

Example 2.1.3. The three structures for a directed acyclic graphical model consistingof three nodes are shown in figure 2.1. In the leftmost model there is conditional inde-pendence between the random variables X2 and X3 given the value of X1, due to the factthat there is no path from either X2 to X3 or X3 to X2 and X1 is a parent of X2. Thisleads to the following local independence:

p(X2, X3|X1) = p(X2|X1)p(X3|X1)

In the graphical model with random variables Y1, Y2 and Y3, there is a local independencebetween Y1 and Y3 given the value of Y2, because Y2 is a parent of Y3 and Y1 not adescendant of Y3. This gives us

p(Y1, Y3|Y2) = p(Y1|Y2)p(Y3|Y2)

In the rightmost graphical model the random variables Z1 and Z2 are not conditionallyindependent given Z3:

p(Z1, Z2|Z3) 6= p(Z1|Z3)p(Z2|Z3)

Using the local Markov assumptions there is a dependence structure defined on Bayesiannetworks from which local probabilities can be computed. From the other definition re-garding the semantics of the Bayesian network global probabilities can be computed. Thedefinition is the following:

8

Definition 2.1.4. Let G = (V , E) be a directed acyclic graphical model, withXv : v ∈ V a collection of random variables indexed by the nodes of G. A distributionp over the same space factorizes according to G if p can be expressed as a product

p(X1, . . . , Xn) =n∏i=1

p(Xi|πXi). (2.1)

This equation is called the chain rule for Bayesian networks and the factors p(Xi|πXi)local probablistic models. [9]

This definition provides a global structure on the Bayesian network. There can now beproven that both definitions for the independence structure on a Bayesian network arein fact equivalent, as stated in the following theorem, for which the proof can be foundin appendix A:

Theorem 2.1.5. Let G = (V , E) be a Bayesian network over a set of random variablesXv : v ∈ V and let p be a joint distribution over the space in which Xv : v ∈ V aredefined. All the local Markov assumptions associated with G hold in p if and only if pfactorizes according to G. [10]

The equivalence between the two definitions of a Bayesian network leads to consistencyand gives a structure on the Bayesian network, that allows for both computing local andglobal probability distributions.

η λ X

Figure 2.2.: Graphical model of a random variable X with parameter λ and hyperparam-eter η.

Example 2.1.6. To show how graphical models can be used, an example will be givenwhich will also show the connection between graphical models and Bayesian statistics.Let X be a random variable with a Poisson(λ) distribution. Let λ be not exactly known,but also have a Poisson distribution, only with parameter η. Then η is an hyperparameterfor λ. Let η have a Γ(α, β) distribution, where α and β are known. A directed graphicalmodel can now be drawn and becomes as shown in figure 2.3. From Bayesian statisticsit is now known that X|λ, η = X|λ. So:

X|λ ∼ Poisson(λ) λ|η ∼ Poisson(η) η ∼ Γ(α, β)

Because X|λ, η = X|λ, the local Markov assumptions hold and thus is the directedgraphical network a Bayesian network. This means that the joint probability distributionp(X,λ, η) should be a factorised distribution:

p(X,λ, η) = p(X|λ)p(λ|η)p(η).

This is indeed the case, because:

p(X,λ, η) = p(X|λ, η)p(λ, η) = p(X|λ)p(λ|η)p(η).

9

2.2. Undirected Graphical models

Having defined the structure in a Bayesian network, there will now be turned to thestructure on an undirected graphical model. Again a local and global definition of theindependence structure in the model are represented, which turn out to be equivalent. Anundirected graphical model with this structure is called a Markov Random Field (MRF).

Definition 2.2.1. Let H = (X ,D) be an undirected graph. Then for each node X ∈ Xthe Markov blanket NH(X) of X is the set of neighbors of X in the graph. The localMarkov independencies associated with H are defined to be

I`(H) = (X ⊥ X − X − NH(X) | NH(X)) : X ∈ X.

In other words, X is conditionally independent of the rest of the nodes in the graph givenits immediate neighbors. [10]

Example 2.2.2. In the undirected graphical model in figure 2.1 these local Markovindependencies translate to Y1 and Y3 being conditionally independent given Y2, becauseY2 is an immediate neighbor of Y3, but Y1 is not .

The global structure of an MRF is defined by its joint probability function. The jointprobability distribution of an MRF is to be computed in terms of factors:

Definition 2.2.3. Let Y be a set of random variables. We define a factor to be a functionfrom dom(Y ) to R+. [10]

Definition 2.2.4. Let H = (X ,D) be a Markov Random Field, where the nodes indexthe random variables X1, . . . , Xn. A distribution p factorizes over H if it is associatedwith

• a set of subsets Y1, . . . , Ym, where each Yi is a complete subgraph of H, meaningthat every two nodes in Yi share an edge;

• factors ψ1[Y1], . . . , ψm[Ym],

such that

p(X1, . . . , Xn) =1

Zp′(X1, . . . , Xn), (2.2)

wherep′(X1, . . . , Xn) = ψ1[Y1]× ψ2[Y2]× · · · × ψm[Ym]

is a measure which is normalized by Z =∑

X1,...,Xnp′(X1, . . . , Xn). Z is called the parti-

tion function. A distribution p that factorizes over H is also called a Gibbs distributionover H. [10]

Example 2.2.5. In the MRF of figure 2.3 the subgraph consisting of the nodes X4, X5

and X6 is complete and will be denoted by Y4. There are three other complete subgraphsof the MRF, denoted by Y1, Y2, Y3 and shown in the right model of figure 2.3. On each ofthese subgraphs Yi a factor ψi[Yi] can be defined. This factor only depends on the values

10

X5X2

X3

X1 X4

X6

Y2

Y1

Y3 Y4

Figure 2.3.: Undirected graphical model (left) and a model of its complete subgraphs(right), with Y1 = X1, X3, Y2 = X2, X3, Y3 = X3, X4, X5 and Y4 =X4, X5, X6.

of the nodes in the complete subgraph and is positive. The Gibbs distribution p on theMRF becomes, with Z a normalising constant:

p(X1, . . . , X6) =1

Zψ1[Y1] · ψ2[Y2] · ψ3[Y3] · ψ4[Y4]

Theorem 2.2.6. Let H = (X ,D) be an MRF and p a Gibbs distribution over H. Thenall the local Markov properties associated with H hold in p. [10]

This theorem can be proven by showing that p(Xi|X \Xi) = p(Xi|NH(X)) The righthand side of the equation can be written out, using the fact that a Gibbs distributionexists for the MRF and making use of the theorem of Bayes, which leads to the left sideof the equation. Because those are the only elements needed in the proof, the proof isomitted here, but it can be found in [2].

From this theorem it can be concluded that the global structure of an MRF as definedby its joint probability distribution implies the local Markov properties to hold. In fact,it can be proven that that the reverse implication also holds, making both structuresequivalent. This result is known as the Hammersley-Clifford theorem.

Theorem 2.2.7. (Hammersley-Clifford) Let H = (X ,D) be an MRF and p a positivedistribution over X , meaning p(x) ≥ 0 for all x ∈ V al(X). If the local Markov propertiesimplied by H hold in p, then p is a Gibbs distribution over H. [10]

A proof of the Hammersley-Clifford theorem can be found in appendix A.

11

3. Exact Algorithms

Now that the local and global structure of Bayesian networks and Markov Random Fieldsare defined, the joint probability distribution is known to be a product of conditionalprobability distributions, respectively a product of factors with a normalization constant.A marginal distribution for a certain random variable Xi, corresponding to a node inthe graph, can now be found by summing over all other variables in the joint probabilitydistribution. However, as can be seen in example 3.1.1, the computational complexity ofcomputing a marginal distribution in this way can be reduced significantly by using exactinference algorithms. The next section will treat an exact inference algorithm, the so-called elimination algorithm, to see how this computational complexity can be reduced.When the elimination algorithm is understood, the chapter will be closed with somegeneral results about exact inference algorithms and their computational complexity.

3.1. The elimination algorithm

Before treating the elimination algorithm formally, the following example is presented,showing how the elimination algorithm is used and how it reduces computational com-plexity.

X3

X5X1

X2

X4

Figure 3.1.: An MRF H with the discrete random variables X1, . . . , X5.

Example 3.1.1. Consider the MRF H presented in figure 3.1, for which the marginal ofX1 is asked. Assume that the cardinality of each variable is r, so |Xi| = r for i = 1, . . . , 5.It can be seen that when taking Y1 = X1, X2, X3, Y2 = X2, X4 and Y3 = X2, X3, X5each Yi is a complete subgraph of H. This means that when introducing the factors ψi[Yi]the joint distribution function becomes

p(X1, . . . , X5) =1

Zψ1[Y1]ψ2[Y2]ψ3[Y3] =

1

Zψ1[X1, X2, X3]ψ2[X2, X4]ψ3[X2, X3, X5],

12

where Z is the corresponding partition function normalizing the distribution function.The marginal of X1 can now be obtained by summing over the remaining variables,because the variables are discrete. This gives the following marginal:

p(X1) =∑X2

∑X3

∑X4

∑X5

1

Zψ1[X1, X2, X3]ψ2[X2, X4]ψ3[X2, X3, X5].

The computational complexity is now r5, because each summation is applied to a termcontaining 5 variables of cardinality r [6]. However, note that the variable X5 doesnot appear in ψ1[X1, X2, X3] or ψ[X2, X4]. The distributive law then gives rise to thefollowing:∑

X5

1

Zψ1[X1, X2, X3]ψ2[X2, X4]ψ3[X2, X3, X5]

=1

Zψ1[X1, X2, X3]ψ2[X2, X4]

∑X5

ψ3[X2, X3, X5]

=1

Zψ1[X1, X2, X3]ψ2[X2, X4]m5(X2, X3),

where it is clear that m5(X2, X3) :=∑

X5ψ3[X2, X3, X5]. Making use of the distributive

law repeatedly, the marginal of X1 can be computed as follows:

p(X1) =∑X2

∑X3

∑X4

∑X5

1

Zψ1[X1, X2, X3]ψ2[X2, X4]ψ3[X2, X3, X5]

=1

Z

∑X2

∑X3

∑X4

ψ1[X1, X2, X3]ψ2[X2, X4]m5(X2, X3)

=1

Z

∑X2

∑X3

ψ1[X1, X2, X3]m5(X2, X3)m4(X2)

=1

Z

∑X2

m4(X2)m3(X1, X2)

=1

Zm2(X1),

with m4(X2) =∑

X4ψ2[X2, X4],m3(X1, X2) =

∑X3ψ1[X1, X2, X3]m5(X2, X3) and

m2(X1) =∑

X2m4(X2)m3(X1, X2). The number of variables that appear together in a

summation is no more than 3, reducing the computational complexity from r5 to r3 [6],which for large r is a significant improvement.

Computing the marginal of X1 using the distributive law in the example above isan example of the elimination algorithm: in each step a new variable is eliminated bysumming out this variable, where the summation is only performed over the factors whichcontain the variable to be eliminated.

13

The variable elimination algorithm will now be stated more formally, based on a de-scription in [9]. Let Ψ be a set of factors and W a subset of k variables which are to beeliminated. Let W1, . . . ,Wk be an ordering of W . When the set of factors would be Ψi,eliminating the variable Wi is done by partitioning Ψi in two sets: one with those factorsthat have Wi as an argument and one with those factors that do not have Wi as an argu-ment. The second set, which will be denoted Ψ′′i , is clearly the complement of the fist set,denoted Ψ′i, such that Ψ′i ∪Ψ′′i = Ψi. Now a new factor φ can be defined as the productof all factors with Wi as argument: φ :=

∏ψ∈Ψ′i

ψ. Now define τ :=∑

Wiφ as the factor

φ with Wi summed out. The variable Wi is now eliminated and a new set of factors Ψi+1,from which the next variable is to eliminated, can be defined as Ψi+1 = Ψ′′i ∪ τ. Now,having an ordering W1, . . . ,Wk, first W1 is eliminated (Ψ1 := Ψ), then W2, etc. WhenWk is eliminated, Ψk+1 is the set of factors left and by the construction of the algorithm,there is no factor left in Ψk+1 containing any Wi, i ∈ 1, . . . , k as an argument.

Let Y be a set of variables for which the probability distribution is asked in a Bayesiannetwork consisting of the variables X = X1, . . . , Xn. This probability distribution p cannow be found by the use of the above algorithm, eliminating the variables W = X − Y .This can be done by choosing the set of factors as the set of conditional probabilitydistributions: Ψ = ψXini=1, where ψXi = p(Xi|πXi). For an MRF consisting of thevariables X = X1, . . . , Xn, with Y ⊂ X the set of variables for which a probabilitydistribution is asked, the algorithm can be done in the same way, only by choosing theset of factors as the set of factors the joint probability distribution consists of.

3.2. NP-completeness

As shown in example 3.1.1, the computational complexity of obtaining a marginal reducessignificantly by the use of the variable elimination algorithm instead of summing out allvariables over all factors. The elimination algorithm is not the only exact algorithm tocompute a marginal probability distribution on Bayesian networks and Markov RandomFields. However, there are results which hold for all these exact algorithms and regardtheir computational complexity.

First the classes NP, P and #P are introduced. Certain problems can be categorizedinto these classes based on their computational complexity. Computing probabilities ingraphical models using exact algorithms is such a problem. The goal in this section willbe to prove that in the best case exponential time is needed to solve such problems. Thismeans that there exist graphical models for which the use of such exact algorithms takesexponential time. If such a graphical model is large, the use of exact algorithms in thismodel will thus be very time consuming. Any information given on NP, P and #P isbased on Cook [3], Koller and Friedman [9], [10] and Valiant [15].

Definition 3.2.1. A decision problem is a problem for which the return values are onlytrue or false (or yes or no, or 0 or 1). P is the class of all decision problems that can besolved in polynomial time. NP is the class of all decision problems for which a positiveanswer can be verified in polynomial time. Solvability in polynomial time means that aTuring machine needs in the worst-case polynomial time to solve this problem.

14

The following example is a decision problem that is in NP.

Example 3.2.2. In a SAT problem a formula in the propositional logic is taken as input.True is returned when there is an assignment which satisfies the formula and if no suchassignment exists, false will be returned. An example is the formula q1 ∨ q2. For thisformula true is returned, q1 = true, q2 = true being a satisfying assignment. However,for the formula q1 ∨ ¬q1 no satisfying assignment exists, resulting in false as output. Arestricted version of this SAT -problem is the 3-SAT problem.

A formula φ is said to be a 3-SAT formula over the Boolean variables q1, . . . , qn if φ isa so-called conjunction: φ = C1∧ · · · ∧Cm. Each Ci is a clause of the form ì,1∨ ì,1∨ ì,3.Each ì,j (i = 1, . . . ,m; j = 1, 2, 3) is either qk or ¬qk for some k = 1, . . . , n. Given a3-SAT formula the decision problem is if this formula can be satisfied. So the decisionproblem is if there is an assignment for q1, . . . , qn that makes φ return true. If there wouldbe an algorithm producing such an assignment in polynomial time for all 3-SAT formulas,the problem would be in P. Let there be a positive answer given, so an assignment forq1, . . . , qn that makes φ return true. If such an assignment can be verified in polynomialtime, for all φ and all positive answers, then the problem would be in NP.

As it turns out, the 3-SAT decision problem is in NP. An example of a 3-SAT formulais φ = (q1 ∨ ¬q2 ∨ ¬q3) ∧ (q1 ∨ q2 ∨ ¬q3). Trivially, q1 = true, q2 = false, q3 = false isan assignment of q1, . . . q3 that makes φ return true. Also, this can trivially be verified inpolynomial time. This is in general the case with 3-SAT problems, so any 3-SAT decisionproblem is in NP. However, no one has yet found an algorithm that proves the problemis also in P.

Instead of wanting to check a positive answer for a decision problem, sometimes it ispreferred to know how many positive solutions there are. For example, the followingproblem is in NP: given a graph G and k ∈ N, does G have a complete subgraph of sizegreater or equal to k? However, if it was preferred to have knowledge about the numberof complete subgraphs of size greater or equal to k, the answer to this NP-problem wouldnot suffice. The question becomes: given a graph G and k ∈ N, how many completesubgraphs of size greater or equal k does G have? This was an example, but for everyproblem in NP the number of positive answers can be counted. The class #P is theclass of problems which count the number of positive answers to a problem in NP inpolynomial time. The 3-SAT problem of example 3.2.2. is in #P, because counting thenumber of satisfying assignments to a certain 3-SAT formula can be done in polynomialtime.

There is a class of mathematicians that state that it is very unreasonable to assumeP = NP, but until present day it is not yet proven that this is indeed the case.

Definition 3.2.3. A problem is called NP-complete if it is in NP and it is NP-hard: thepolynomial solvability of this problem would imply that all other problems in NP aresolvable in polynomial time as well.

The definition of NP-completeness states that it is possible to transform any problemin NP in polynomial time to the NP-complete problem such that the solution of the NP-complete problem also gives a solution to the original problem in NP. This means that if

15

for one NP-complete problem it can be proven that it is solvable in polynomial time, allproblems in NP can be solved in polynomial time, from which can be concluded P = NP.However, as stated above, if one assumes P 6= NP, then no NP-complete problem wouldbe solvable in polynomial time.

To provide information about the computational complexity of computing distributionsin graphical models using exact algorithms, two theorems will be stated that indicate inwhich computational complexity class this problem is. The first theorem is the following:

Theorem 3.2.4. The following problem is NP-complete: given a Bayesian network Gover X , a variable X ∈ X , and a value x ∈ V al(X), decide whether p(X = x) > 0.

This result states that in a Bayesian network, having x ∈ V al(X) for whichp(X = x) > 0, checking if p(X = x) > 0 can be done in polynomial time. The prob-lem being NP-complete results in the conclusion that, if P 6= NP, the computationalcomplexity can not be polynomial. So in the best case the computational complexity isexponential. A proof of the theorem can be found in appendix B.

The result only concerns the problem of deciding whether p(X = x) > 0 in a BN Gwith a variable X. Often one is more interested in computing the actual probabilityp(X = x). A last result that will be stated concerns this problem and is the following,where #P-completeness is in the same way defined as NP-completeness only for the class#P:

Theorem 3.2.5. The following problem is #P-complete: given a BN over X , a variableX ∈ X and a value x ∈ V al(X), compute p(X = x).

Of course, because #P concerns counting problems instead of decision problems, an#P problem can not be an NP problem. It turns out that counting the number ofsatisfying assignments for a 3-SAT formula is the canonical #P-hard problem. Althoughthis problem can not be in NP, it is NP-hard, because if it can be solved then the 3-SAT decision problem can be solved. The general belief is that the counting versionof the 3-SAT problem is more difficult than the decision problem itself. From this andthe theorem it can again be concluded that there exist Bayesian networks for whichcomputing p(X = x) can in the best case be done in exponential time.

Thus the computational complexity of computing exact probabilities in Bayesian net-works is exponential. This indicates that for some large Bayesian networks, exact algo-rithms are time consuming. Because a lot of graphical models in Artificial Intelligenceand Physics are very large, good approximation algorithms for computing probabilitiesin graphical models that can work in polynomial time are desirable. In the next chapterone approximation method is analyzed, called variational inference.

16

4. Variational Inference

Let a graphical model G be given, where the nodes index the random variablesX = (X1, . . . , Xn), whose dependence is defined by the edges of the graphical model. LetX have values in X = X1 × X2 × · · · × Xn. The goal of statistical inference in graphicalmodels is now to deduce the underlying distribution of these random variables. A methodto do this is by using variational inference. This is an approximating inference method,which transforms the problem of computing a distribution to an optimization problem.It is widely used because it works fast. The analysis of variational inference as presentedin this chapter is based on an analysis by Wainwright and Jordan in [17] and [16] and ananalysis by Blei [1].

In studying variational inference, we will focus on exponential families. Reason for thisis that exponential families are mathematical tractably and also versatile modelling tools[8]. The most well-known probability distributions, such as the Gaussian, Exponential,Poisson and Gamma distribution, are exponential families. Example 4.0.1. will showthat with respect to the Lebesgue measure, the collection of normal distributions formsan exponential family.

The density of the exponential family is taken with respect to some measure ν which hasdensity h : X → R+ with respect to dx, the components of dx being counting measuresfor a discrete space or in a continuous case Lebesgue measures. An exponential familycan now be written in the following form, where x ∈ X :

pθ(x) = exp 〈θ, φ(x)〉 − A(θ), (4.1)

〈·, ·〉 being the standard inner product on Rd. Here, φ = (φα : X → R)α∈I is a vector offunctions from X to Rd, where I is a vector indexing φα and d = |I|. The φα are functions(in fact, they are sufficient and complete statistics). θ = (θα)α∈I is called the vector ofcanonical parameters associated with φ and A(θ) the cumulant generating function orlog partition function. A(θ) normalizes pθ to make it a density and is thus defined by thefollowing logarithm:

A(θ) = log

∫X

exp 〈θ, φ(x)〉ν(dx).

However, A(θ) normalizes pθ only if A(θ) is finite. Thus, only the canonical parametersθ contained in the set Ω = θ ∈ Rd | A(θ) < ∞ will be taken into account as possibleparametrizations.

The method of variational inference for exponential families in graphical models is nowas follows. Assume that the factors in equation (2.1) (or (2.2), when the graphical model Gis undirected) can be expressed in exponential family form, relative to a common measureν. The joint distribution, which is a product of such factors, is then also in exponentialfamily form and can be written as (4.1). In addition, assume that the parameterspace Ω

17

is open. Exponential families for which Ω is open are called regular. Now the inferenceproblem is to find the underlying distribution of X = (X1, . . . , Xn), thus the underlyingjoint distribution. Variational inference tackles this problem by transforming the problemof computing the cumulant function A(θ) to the following optimization problem:

A(θ) = supµ∈M〈θ, µ〉 − A∗(µ), (4.2)

where the definition ofM and A∗(µ) will be given in the following section, which describesthe transformation of the problem. The optimization problem itself also has to be solved,for which Mean Field approximation is often used, which only looks at a subset of Mand is described in section 4.2.

Example 4.0.1. The density of the normal distribution with parameters (µ, σ2) can bewritten as:

p(µ,σ2)(x) =1√2π

exp− 1

2σ2(x− µ)2

=

1√2π

exp− 1

2σ2x2 +

µ

σ2x− 1

2σ2µ2 − log σ

=

1√2π

exp⟨(− 1

2σ2µσ2

),

(x2

x

)⟩− µ2

2σ2− log σ

So by defining θ =

(− 1

2σ2 ,µσ2

)Tand noting that µ2

2σ2 + log σ = − θ214θ2− 1

2log−2θ2 =: A(θ)

the collection of all normal distributions is indeed an exponential family.

4.1. Transforming the cumulant function

The exponential family described is characterized by θ ∈ Ω, the vector of canonical pa-rameters. However, in transforming the inference problem to an optimization problem analternative parametrization is used in terms of a vector of mean parameters or realizableexpected sufficient statistics. These parameters arise as the expectations of φ under adistribution that is absolutely continuous with respect to the measure ν. The set M ofrealizable mean parameters is defined as:

M =µ ∈ Rd | ∃p s.t.

∫φ(x)p(x)ν(dx) = µ

.

Note that in this definition, p is only assumed to be absolutely continuous with respectto ν, not that p is an exponential family. Later on, when describing Mean Field approx-imation, the supremum in (4.2) will not be taken over M but over Mtract, which is asubset of M that only uses tractable distributions p.

As it turns out, there is a one-to-one mapping between the canonical parameters θ ∈ Ωand the interior ofM. This fact, and the fact that this mapping is given by the gradientof the log partition function A, will now be proved. The results will be used to transformthe inference problem of finding an expression for (4.1) to the optimization problem in(4.2).

18

Lemma 4.1.1. The cumulant function

A(θ) := log

∫X

exp 〈θ, φ(x)〉ν(dx)

associated with any regular exponential family has the following properties:

(a) It has derivatives of all orders on its domain Ω. Especially it holds that:

∂A

∂θα(θ) = Eθ[φα(X)] :=

∫φα(x)pθ(x)ν(dx). (4.3)

∂2A

∂θα∂θβ(θ) = Eθ[φα(X)φβ(X)]− Eθ[φα(X)]Eθ[φβ(X)]. (4.4)

(b) A is a convex function of θ on its domain Ω, and strictly so if the representation isminimal: for φ there exists no vector a ∈ Rd such that 〈a, φ(x)〉 is a constant.

Proof. By use of the dominated convergence theorem, it can be verified that differentiat-ing through the integral of A is valid, which yields:

∂A

∂θα(θ) =

∂

∂θαlog

∫X

exp 〈θ, φ(x)〉ν(dx) =

∫X

∂∂θα

exp 〈θ, φ(x)〉ν(dx)∫X exp 〈θ, φ(x)〉ν(dx)

=

∫Xφα(x)

exp 〈θ, φ(x)〉ν(dx)∫X exp 〈θ, φ(x)〉ν(dx)

= Eθ[φα(X)].

The second equation can be proven in the same manner and higher-order derivatives alsofollow from the same procedure.

To prove the second statement, note that the second-order partial derivative in (4.4)is equal to Cov[φα(X), φβ(X)], implying the full Hessian ∇2A(θ) to be the covariancematrix of φ(X). The covariance matrix is always positive semidefinite, so ∇2A(θ) ispositive semidefinite, which implies A to be convex [5]. Let the representation now beminimal, such that there is no nonzero a ∈ Rd and b ∈ R such that 〈a, φ(x)〉 = b holdsν-a.e. This yields Varθ[〈a, φ(x)〉] = aT∇2A(θ)a > 0 for all a ∈ Rd, θ ∈ Ω. So the Hessianis strictly positive definite and so is strictly convex [5].

From this lemma it can be seen that the range of ∇A is contained within the set ofrealizable mean parameters M. So ∇A(θ) can transform the problem with parametersθ ∈ Ω in a problem with parameters µ ∈ M if the function is bijective, giving a one-to-one correspondence between θ and µ. As it turns out, some extra conditions are neededto guarantee this bijectivity, starting with the need for the representation to be minimalto make ∇A injective:

Lemma 4.1.2. The gradient mapping ∇A : Ω → M is injective if and only if theexponential representation is minimal.

Proof. Assume the representation to be not minimal, indicating the existence of a nonzerovector a ∈ Rd for which 〈a, φ(x)〉 is a constant ν-a.e. Let θ1 ∈ Ω and define θ2 = θ1 + at,

19

t ∈ R. Due to the fact that Ω is open, t can be chosen such that θ2 ∈ Ω. Linearity inthe first component of the inner product yields 〈θ2, φ(x)〉 = 〈θ1, φ(x)〉 + t〈a, φ(x)〉. a isdefined such that the last term is a constant, resulting in the fact that pθ1 and pθ2 inducethe same distribution (the only difference being the normalization constant), such that∇A(θ1) = ∇A(θ2) and ∇A is not injective.

On the contrary, assume the representation to be minimal. Lemma 4.1.1 then statesthat A is strictly convex. This property, and the fact that A is differentiable, givesA(θ2) > A(θ1) + 〈∇A(θ1), θ2 − θ1〉 for all θ1 6= θ2 in Ω. The same inequality can bederived when θ1 and θ2 are reversed and if those inequalities are added, the followinginequality arises:

〈∇A(θ1)−∇A(θ2), θ1 − θ2〉 > 0.

So if θ1 6= θ2 it follows that ∇A(θ1) 6= ∇A(θ2) and thus that ∇A is injective.

Surjectivity of ∇A can now be guaranteed if the range is restricted to the interior ofM, as stated in the following theorem:

Theorem 4.1.3. In a minimal exponential family, the gradient map ∇A is onto M.Consequently, for each µ ∈M, there exists some θ = θ(µ) ∈ Ω such that Eθ[φ(X)] = µ.

A proof of this theorem can be found in appendix C. The last lemma and this theoremtogether give that for minimal exponential families, each mean parameter µ ∈ M isrealized uniquely by a density in the exponential family. The definition of M allows pto be an arbitrary density, while a typical exponential family is only a strict subset of allpossible densities. So there are other densities p which also realize µ, where p does nothave to be a member of the exponential family. In Mean Field approximation, where notM butMtract is used, there could be chosen for anMtract with only densities p that areexponential families, because these are tractable.

The property that makes the exponential distribution pθ(µ) special is the fact that,considering all distributions that realize µ, it has a connection with the maximum en-tropy principle, which will be discussed now by first introducing two new definitions: theBoltzmann-Shannon entropy and the Fenchel-Legendre conjugate.

Definition 4.1.4. The Boltzmann-Shannon entropy is defined as

H(pθ(x)) = −∫Xpθ(x) log pθ(x)ν(dx),

where pθ is a density with respect to the measure ν.

Definition 4.1.5. Given a log partition function A, the conjugate dual function orFenchel-Legendre conjugate of A, denoted by A∗, is defined as:

A∗(µ) = supθ∈Θ〈µ, θ〉 − A(θ),

µ ∈ Rd being a vector of variables having the same dimension as θ, called dual variables.

These definitions are combined in the following theorem, which also gives the relationbetween A and A∗, resulting in the optimization problem for A given in (4.2).

20

Theorem 4.1.6. (a) For any µ ∈ M, let θ(µ) denote the unique canonical parame-ter corresponding to µ, so ∇A(θ) = µ. The Fenchel-Legendre dual of A has thefollowing form:

A∗(µ) =

−H(pθ(x)) if µ ∈M

∞ if µ /∈M

For any boundary point µ ∈M \M the dual becomes

A∗(µ) = limn→∞

[−H

(pθ(µn)(x)

)],

µn being a sequence in M converging to µ.

(b) In terms of this dual, the log partition function has the following variational repre-sentation:

A(θ) = supµ∈M〈θ, µ〉 − A∗(µ).

(c) For all θ ∈ Ω, the supremum in the variational representation of A(θ) is attaineduniquely at the vector µ = Eθ[φ(x)]. This statement holds in both minimal andnonminimal representations.

Before discussing the consequences of this theorem, an example will be given for theexponential distribution in which the expression for the cumulant function A of the expo-nential distribution is transformed in the optimization problem expressed in (4.2). Thisverifies the theorem for the exponential distribution. The proof of the theorem for allregular exponential distributions can be found in appendix C.

Example 4.1.7. Let θ ∈ (0,∞) and X be exponentially distributed with parameter θ.Then, for x ∈ V al(X):

pθ(x) = θe−θx = e−θx+log θ = e〈−θ,x〉+log θ

Thus for θ ∈ (−∞, 0) the distribution can be written as

pθ(x) = e〈θ,x〉+log−θ

So if X is exponentially distribution with parameter θ ∈ (−∞, 0) then it is a memberof the exponential family with φ(x) = x and A(θ) = − log−θ. The Fenchel-Legendreconjugate dual A∗(µ) is now defined as

A∗(µ) = supθ∈(−∞,0)µθ + log (−θ)

Taking the derivative with respect to θ and equating this to 0 to find the optimum yieldsµ = 1

θ. This value for µ indeed satisfies ∇A(θ) = µ.

Because φ(x) = x, M = µ : ∃p such that∫xp(x)d(νx) = µ. x ∈ V al(X) gives

x ≥ 0, otherwise x is not in the support of the exponential family. This and p(x) ≥ 0for all x ∈ V al(X),

∑p(x) = 1, yield

∫xp(x)d(νx) > 0 for all distributions p. Thus

21

M = (0,∞) = M. Now, for all µ ∈ M, A∗(µ) = −1 − log µ. This is indeed equal tothe negative Boltzmann-Shannon entropy:

−H(pθ) =

∫[0,∞)

pθ(x) log pθ(x)ν(dx) =

∫[0,∞)

eθx+log−θ(θx+ log−θ)dx

= Epθ [θx+ log−θ] = θ · −1

θ+ log−θ = −1− log (−θ)−1

= −1− log−1

θ= −1− log µ

This verifies A∗(µ) = −H(pθ) for µ ∈M. For µ /∈M, so for µ < 0, µθ+log−θ increases

when θ → −∞. This yields A∗(µ) =∞. The theorem now gives in (b) as value of A(θ):

A(θ) = supµ>0µθ + 1 + log µ

It is easily proved by differentiating that this is equal to − log−θ where only once thesupremum is attained, namely when µ = −1/θ. This also verifies (c).

From theorem 4.1.6 it can be concluded that A(θ) can be written as a variationalproblem with a unique solution, which was explicitly shown in example 4.1.7 for theexponential distribution. The first goal of variational inference was transforming theproblem of computing the probability distribution in (4.1) to the problem of optimizingthe cumulant function in (4.2), meaning that goal is now achieved. However, that is justthe first part of variational inference, because the optimization problem still has to besolved. Solving the optimization problem is not necessarily an easy task. In a lot ofmodels either A∗(θ) has no explicit form or the complexity of M makes it hard to takea supremum. When this is the case and it can not be directly derived what the value ofA(θ) is, all that is left is the next inequality which gives a lower-bound on A(θ), wherethe inequality follows directly from the definition of A(θ) as optimization problem:

A(θ) ≥ 〈θ, µ〉 − A∗(µ). (4.5)

There are certain methods developed to maximize this lower-bound on A(θ). A certainclass of these methods is the Mean-Field variational methods. These methods will bestudied in the next section and as mentioned a few times already, they only look atMtrac ⊂M in which all distributions p are tractable.

4.2. Mean Field Approximation

The mean field approach deals with the difficulties concerning the optimization problemof optimizing (4.2) by only looking at distributions for which M and A∗(θ) can be char-acterized. Such distributions are called tractable. So Mtract is chosen such that boththe computation of A∗(θ(µ)) (where θ(µ) is the value of θ corresponding to the meanparameter µ ∈ Mtract) and the maximization over Mtract can be done tractably. So the

22

mean field approach only looks for the best lower bound within the setMtract. This lowerbound is then equal to

supµ∈Mtract

〈µ, θ〉 − A∗(θ(µ))

and the value µ ∈ Mtract realizing this supremum lower bound is called the mean fieldapproximation of the true mean parameters.

It turns out that there exists a close connection between maximizing the lower boundand Kullback-Leibler divergence. The Kullback-Leibler divergence is defined as follows:

Definition 4.2.1. Let X ∈ X be a continuous random variable. Given two distributionswith densities q and p with respect to a base measure ν, the KL (Kullback-Leibler)divergence is defined as

D(q||p) =

∫Xq(x)

[log

q(x)

p(x)

]ν(dx).

The integral is replaced by a sum in the case of discrete variables.

To indicate the connection, for θ1, θ2 ∈ Ω let pθ1 and pθ2 be two probability distribu-tions from some regular exponential family with statistics φ(X). Let µ1 and µ2 be thecorresponding mean parameters to respectively θ1 and θ2, meaning Eθi [φ(X)] = µi fori = 1, 2.

Computing the KL divergence now gives the following:

D(pθ1||pθ2) = Eθ1[log

pθ1(X)

pθ2(X)

]= Eθ1

[log e〈θ1,φ(X)〉−A(θ1) − log e〈θ2,φ(X)〉−A(θ2)

]= Eθ1 [〈θ1, φ(X)〉 − 〈θ2, φ(X)〉+ A(θ2)− A(θ1)]

= A(θ2)− A(θ1) + Eθ1 [〈θ1, φ(X)〉 − 〈θ2, φ(X)〉]= A(θ2)− A(θ1) + 〈θ1, µ1〉 − 〈θ2, µ1〉= A(θ2)− 〈θ1, µ1〉+ A∗(µ1) + 〈θ1, µ1〉 − 〈θ2, µ1〉= A(θ2) + A∗(µ1)− 〈µ1, θ2〉

With this in mind, the connection is immediately clear, because maximizing the lowerbound in (4.5), with µ ∈Mtract, equals minimizing A(θ)+A∗(µ)−〈µ, θ〉 := D(pθ′(µ)||pθ),where θ′(µ) is the canonical parameter corresponding to µ.

Conclude that mean field approximation maximizes the lower bound in (4.5) by takinga subset Mtract of M, for which A∗(µ) can be computed and the set Mtract is such thatmaximizing is possible. This is equivalent to minimizing the Kullback-Leibler divergencebetween the distribution and all tractable distributions.

The statistical setting was a graphical model G with random variablesX = (X1, . . . , Xn),where the joint probability distribution had exponential family form. The problem wasto find this joint probability distribution. Variational inference tackled the problem bytransforming the computation of the cumulant function A(θ) to maximizing 〈θ, µ〉−A∗(µ)and again transforming this problem to the minimization of the KL divergence betweenthe joint probability distribution and all tractable distributions. The chapter will nowbe closed by analyzing the Variational Message Passing algorithm, which is an exampleof an algorithm using variational inference to find a distribution of a Bayesian network.

23

4.3. Variational Message Passing

There are certain variational algorithms that try to minimize the KL divergence. Anexample of such an algorithm is the Variational Message Passing (VMP) algorithm, whichwill be treated in this section and is used in Bayesian networks. The algorithm is describedbased on an article by Winn and Bishop [18].

The statistical setting in which VMP works is as follows. Let G = (V , E) be a Bayesiannetwork, where the nodes index the random variables X1, . . . , Xn. The dependence struc-ture of X1, . . . , Xn is given by the edges E . Because G is a Bayesian network, the jointdistribution can be written as

p(X1, . . . , Xn) =∏

p(Xi|πXi). (4.6)

Assume that all factors in (4.6) are in exponential family form. In addition, assume thatwhen Y is a parent of X and X is a parent of W , the functional form, with respectto X, of both p(X|Y ) and p(W |X) is the same. For exponential families this impliesthat if p(X|Y ) can be written as exponential family with a vector of sufficient statisticsgiven by φ(X), then so can p(W |X). Furthermore, assume that the value for a partof the random variables X1, . . . , Xn is observed. Denote these observed variables asE = (E1, . . . , Em),m ≤ n. The values for the other random variables are not yet observedand the collection of these random variables will be denoted as H = (H1, . . . , Hn−m). Seefor an example figure 4.1. The goal in this graphical model is to compute the conditionaldistribution p(H|E).

E1H1 H3

H2 E2

Figure 4.1.: Bayesian network of the observed random variables (grey) E1, E2 and unob-served random variables (white) H1, H2 and H3.

The VMP uses variational inference to find p(H|E). At first, using the obtained resultsof Mean Field approximation, VMP tries to find p(H|E) by minimizing the KL divergencebetween the true distribution p(H|E) and all tractable distributions q(H). To make qtractable, q is assumed to be factorised with respect to each Hi, so

q(H) =∏i

qi(Hi).

The VMP algorithm can now be found by performing the following steps:

(i) Rewrite D(q||p) such that minimizing this equals minimizing D(qj||p) with respectto each j for a certain distribution p.

(ii) Note that D(qj||p) attains its minimum when qj(Hj) is chosen equal to p.

24

(iii) Use (4.6) to rewrite qj(Hj) after which it becomes evident that qj(Hj) can becomputed locally, as a sum of terms in which in each term only the children or theparents of Hj are involved. Namely only terms p(Hj|πHj) and p(X|πX) where X isa child of Hj.

(iv) Use the fact that the functional form of p(Hj|πHj) and p(X|πX) has to be thesame, with respect to Hj, which implies they are both exponential families withsufficient statistics φj(Hj). Because these terms are the only terms present in thecomputation of qj(Hj), qj(Hj) can be written as exponential family with statisticsφj(Hj).

These steps are now performed in the rest of the section to find a distribution for p(H|E)and to derive at the VMP algorithm.

To minimize D(q||p), note that it can be rewritten as follows (when X1, . . . , Xn arediscrete replace the integrals by sums):

D(q || p) =

∫Hq(H) log

q(H)

p(H|E)ν(dH)

=

∫Hq(H) log

q(H)

p(H|E)ν(dH) +

∫Hq(H) log

p(H,E)

q(H)ν(dH)

−∫Hq(H) log

p(H,E)

q(H)ν(dH)

=

∫Hq(H) log p(E)ν(dH)−

∫Hq(H) log

p(H,E)

q(H)ν(dH) = log p(E)− L(q)

where

L(q) :=

∫Hq(H) log

p(H,E)

q(H)ν(dH). (4.7)

Because E consists of all nodes that are already observed, log p(E) is a constant. Thisimplies that minimizing the KL divergence D(q||p) equals maximizing L(q). Now usingthe factorised form of q, it can be seen that

L(q) =

∫Hq(H) log

p(H,E)

q(H)ν(dH)

=

∫H

∏i

qi(Hi)[log p(H,E)−

∑i

log qi(Hi)]ν(dH)

=

∫H

∏i

qi(Hi) log p(H,E)∏i

ν(dHi)−∑i

∫H

∏j

qj(Hj) log qi(Hi)ν(dHi)

=

∫Hqj(Hj)[log p(H,E)

∏i 6=j

(qi(Hi)ν(dHi)]ν(dHj)−∫Hqj(Hj) log qj(Hj)ν(dHj)

−∑i 6=j

∫Hqi(Hi) log qi(Hi)ν(dHi).

25

Now denote

log p(H,E) = 〈log p(H,E)〉i 6=j =

∫H

log p(H,E)∏i 6=j

(qi(Hi)ν(dHi)) (4.8)

to rewrite L(q) as

L(q) =

∫Hqj(Hj) log p(H,E)ν(dHj)−

∫Hqj(Hj) log qj(Hj)ν(dHj)

−∑i 6=j

∫Hqi(Hi) log qi(Hi)ν(dHi)

= −D(qj(Hj)||p)−∑i 6=j

∫Hqi(Hi) log qi(Hi)ν(dHi).

Conclude that D(q||p) is minimized when L(q) is maximized and this happens whenD(qj||p) is minimized with respect to each j. The minimum of D(qj||p) is attained if andonly if qj is chosen equal to p. So the optimal expression q∗j (Hj) is:

log q∗j (Hj) := log p(H,E) + c, c ∈ R (4.9)

where c can be found by normalizing. As can be seen, each qj(Hj) depends on the factorsqi, i 6= j. So choosing an initial distribution, one can update each factor qj by cyclingthrough the factors. Now, substituting (4.6) in (4.9) gives

log q∗j (Hj) =∑i

log p(Xi|πXi) + c

Terms in the sum over i that are independent of Hj will be constant and finite, thus:

log q∗j (Hj) = log p(Hj|πHj) +∑

k:Hj∈πXk

log p(Xk|πXk) + c′.

The final expression for log q∗j (Hj) shows all that is needed to compute q∗j (Hj) islog p(Hj|πHj), only depending on the parents of Hj, and

∑k:Hj∈πXk

log p(Xk|πXk), only

depending on the children of Hj and their parents. The first term can therefore be viewedas a message from the parents, while the terms in the sum can be viewed as messages fromits children. So an update for qj(Hj) can be computed locally, only involving messagesfrom the children and parents of Hj.

Assume that by cycling through the factors qj, the term qj(Hj) needs to be optimized.Let X be a child of Hj and define πX := πX \Hj. Then, by the fact that the factors in(4.6) are exponential families:

p(Hj|πHj) = hj(Hj) exp 〈ηj(πHj), φj(Hj)〉 − Aj(πHj) (4.10)

p(X|Hj, πX) = hX(X) exp 〈ηX(Hj, πX), φX(X)〉 − AX(Hj, πX) (4.11)

26

where ηj, ηX , hj, hX are functions, φj, φX sufficient and complete statistics and Aj, AXcumulant functions. One of the assumptions was that both expressions had the samefunctional form with respect to Hj. So both expressions can be written as exponentialfamilies with sufficient statistics φj(Hj). Thus functions A′ and ηj,X can be defined suchthat (4.11) can be rewritten as:

p(X|Hj, πX) = exp 〈ηj,X(X, πX), φj(Hj)〉 − A′(X, πX) (4.12)

Using this, (4.9) now becomes:

log q∗j (Hj) =⟨hj(Hj) + 〈ηj(πHj), φj(Hj)〉 − Aj(πHj)

⟩i 6=j

+∑

k:Hj∈πXk

⟨〈ηj,Xk(Xk, πXk), φj(Hj)〉 − A′(Xk, πXk)

⟩i 6=j

+ c′

=⟨〈ηj(πHj)〉i 6=j +

∑k:Hj∈πXk

〈ηj,Xk(Xk, πXk)〉i 6=j, φj(Hj)⟩

+ hj(Hj) + c′′

It can now be concluded that q∗j is also a member from the exponential family, havingparameter vector

η∗j = 〈ηj(πHj)〉i 6=j +∑

k:Hj∈πXk

〈ηj,Xk(Xk, πXk)〉i 6=j. (4.13)

The first term can be seen as a message from the parents of Hj. The second as a messagefrom the children and co-parents of Hj. These messages together form η∗j , which is theparameter vector of qj(Hj) and thus completely fixes its distribution.

It can be concluded that the VMP algorithm works as follows. It finds a tractabledistribution q that minimizes the KL divergence between all tractable distributions andthe true distribution p. So this distribution q is chosen as optimal distribution in approx-imating the distribution p. The initial distribution of q is given by

q(H) =∏i

qi(Hi).

The algorithm cycles through the factors and updates each term, where each factor qj(Hj)is an exponential family with parameter vector given by (4.13).

27

5. Applications

In this chapter, two applications of variational inference are shown. The first will bethe approximation of the posterior, which is done by mean field approximation. Thesecond application is called variational Bayes and gives an example of how mean fieldapproximation can be used in Bayesian statistics, in the form of computing the so-calledmarginal likelihood.

5.1. Approximation of the posterior

The first application lies in the approximation of the posterior and follows an examplefrom David Blei [1].

Let G be a graphical model in which the values for the random variablesX = X1, . . . , Xn are observed and the values for the remaining variablesY = Y1, . . . , Ym are unobserved. The dependence structure of the random variablesin G is defined by the edges of G. Let θ be a hyperparameter. Then p(Y |X, θ) is theposterior distribution, given by

p(Y |X, θ) =p(Y,X|θ)p(X|θ)

. (5.1)

Assume that this posterior distribution is from an exponential family. The goal in thisstatistical setting is now to approximate this posterior distribution p(Y |X, θ) using MeanField approximation methods. As it turns out, the cumulant function A(θ,X) is inthis case equal to p(X|θ). With the results of the last chapter this means that forµ ∈ Mtract, the set of realizable and tractable mean parameters, it is the case thatp(X|θ) ≥ 〈θ, µ〉 − A∗(µ). So the goal is to maximize this lower-bound and in that waylower-bound the posterior distribution.

First, notice that (5.1) can be rewritten as

log p(Y |X, θ) = log p(Y,X|θ)− log p(X|θ)

meaning that in this case A(θ,X) = log p(X|θ) and 〈ξ(X, θ), φ(Y )〉 = p(Y,X|θ), whereξ is a function of X and θ. The lower bound on A(X, θ) given by inequality (4.5), thenbecomes, with µ ∈M the mean parameter corresponding to the parameter θ and H theBoltzmann-Shannon entropy:

log p(X|θ) ≥ 〈ξ(X, θ), µ〉 − A∗(µ) = Eθ[log p(Y,X|θ)] +H(p(Y |θ))= Eθ[log p(Y,X|θ)]− Eθ[log p(Y |θ)]

28

Now Mean Field approximation will be used. So µ will be chosen as µ ∈ Mtract

for some Mtrac. For this µ, p(Y |θ) needs to be tractable. Tractable subfamilies ofexponential family distributions can be constructed by considering factorized families. Insuch a factorized family each factor is an exponential family distribution which dependson what is called a variational parameter. So, writing p(Y |θ) = q(Y ), consider q(Y |η) =∏m

i=1 q(Yi|ηi), η = (η1, . . . , ηm) being the variational parameters. For all i, q(Yi|ηi) is inthe exponential family and can be written as

q(Yi|ηi) = h(Yi) exp 〈ηi, Yi〉 − A(ηi) (5.2)

with respect to the Lebesgue measure. So Mtract is here chosen such that the tractabledistributions in this set can be written as (5.2).

Using the factorized form of q, the resulting lower bound for log p(X|θ) becomes:

log p(X|θ) ≥ Eθ[log p(Y1, . . . , Ym|X, θ)p(X|θ)]−m∑i=1

Eθ[log q(Yi|ηi)]

= Eθ[log p(Y1, . . . , Ym|X, θ)] + Eθ[log p(X|θ)]−m∑i=1

Eθ[log q(Yi|ηi)]

=m∑i=1

Eθ[log p(Yi|Y1, . . . , Yi−1, X, θ)] + log p(X|θ)−m∑i=1

Eθ[log q(Yi|ηi)]

Now a lower bound can be found that optimizes this lower bound with respect to ηi. Todo this, reorder Y such that Yi is the last variable and let Y−i be Y \ Yi. Note that thepart that depends on ηi is then given by:

ì = Eθ[log p(Yi|Y−i, X, θ)]− Eθ[log q(Yi|ηi)].

Now, using (5.2), this is equal to

ì = Eθ[log p(Yi|Y−i, X, θ)]− Eθ[log h(Yi)]− Eθ[〈Yi, ηi〉 − A(ηi)]

= Eθ[log p(Yi|Y−i, X, θ)]− Eθ[log h(Yi)]− ηTi A′(ηi) + A(ηi),

where the last step follows because A′(ηi) = Eθ[Yi]:

A′(ηi) =log∫Yi exp 〈ηi, Yi〉

log∫

exp 〈ηi, Yi〉=

∫Yi exp 〈ηi, Yi〉 − A(ηi) =

∫Yiq(Yi|ηi) = Eθ[Yi].

Using∂

∂ηi

(ηTi A

′(ηi))

= ηTi A′′(ηi) + A′(ηi),

differentiating ì with respect to ηi gives

∂

∂ηiì =

∂

∂ηi

(Eθ[log p(Yi|Y−i, X, θ)]− Eθ[log h(Yi)]

)− ηTi A′′(ηi).

29

This results in an optimal value of ηi equal to

ηi = [A′′(ηi)]−1( ∂

∂ηiEθ[log p(Yi|Y−i, X, θ)]−

∂

∂ηiEθ[log h(Yi)]

)(5.3)

So one can update η1, . . . , ηm iteratively, finding the value of η that optimizes the lowerbound for p(X|η).

So the problem of lower-bouding the posterior distribution p(Y |X, θ) can, using varia-tional inference, be transformed to the problem of maximizing

E[log p(Y,X|θ)]− E[log p(Y |θ)].

When p(Y |θ) is chosen tractable, it can be chosen as a factorized family with variationalparameters η. Then there needs to be optimized with respect to these parameters η,which happens when they are as in (5.3).

5.2. Variational Bayes

Variational inference also finds its applications in Bayesian statistics. An example ofhow graphical models and Bayesian statistics relate is already given in example 2.1.6. Itcan now be analyzed how variational inference works in graphical models that describe aproblem in Bayesian statistics. In particular, the usage of mean-field variational methodsto Bayesian inference can be analyzed and is often called variational Bayes. FollowingWainwright and Jordan [17], this application of variational inference will be studied inthis section.

Again, consider a graphical model with X the observed variables and Y the unobservedvariables, having parameter θ. Assume that p(X, Y |θ) is a member of the exponentialfamily, meaning

p(X, Y |θ) = exp 〈η(θ), φ(X, Y )〉 − A(η(θ)),

where the function η : Rd → Rd is just a reparametrization of θ. Furthermore, assumethat the prior over Ω is also a member of the exponential family and has, for α, β ∈ R,the following form:

pα,β(θ) = exp 〈α, η(θ)〉 − βA(η(θ))−B(α, β),

which is called the conjugate prior form. This means that given the likelihood functionthe posterior will have the same functional form as the prior. More formal, the prior isan element of the conjugate family of p, defined as:

Definition 5.2.1. Let (P ,A) be a measurable model for an observation Y ∈ Y . Let Mdenote a collection of probability distributions on (P ,A). The set M is called a conjugatefamily for the model P , if the posterior based on a prior from M again lies in M [8].

B(α, β) is a normalization constant and defined as:

B(α, β) = log

∫exp 〈α, η(θ)〉 − λA(η(θ))dθ.

30

We choose for this form of the likelihood and the prior because a lot of models are specifiedin this way. A central problem in Bayesian statistics is now to compute

pα,β(x) :=

∫p(x|θ)pα,β(θ)dθ,

which is called the marginal likelihood, with x an observed value for X and α, β fixedvalues for α and β. To see how variational Bayes works, the problem of computing themarginal likelihood will be tackled. Mean Field approximation will be used to approx-imate this marginal likelihood. A graphical model for this problem is shown in figure5.1.

Yβ

θ

αX

Figure 5.1.: Observed variables X, unobserved variables Y , parameter θ and hyperpa-rameters α, β.

To tackle the problem, first Jensen’s inequality is used to derive the following lowerbound:

log pα,β(x) = log

∫p(x|θ)pα,β(θ)dθ

= log

∫pα,β(θ)

pα,β(θ)

pα,β(θ)p(x|θ)dθ

≥∫pα,β(θ) log

pα,β(θ)

pα,β(θ)dθ +

∫pα,β(θ) log p(x|θ)dθ

= Epα,β [logpα,β(θ)

pα,β(θ)] + Epα,β [log p(x|θ)].

Now, define

Ax(η(θ)) := log∫Y

exp 〈η(θ), φ(x, y)〉ν(dy)

and note that

log p(x|θ) = log

∫Y

exp 〈η(θ), φ(x, y)〉 − A(η(θ))ν(dy) = Ax(η(θ))− A(η(θ)). (5.4)

For every observed x, the set of mean parameters Mx can be defined, where µ ∈ Mx ifµ = E[φ(Y, x)] for some distribution p. Let µ(θ) be the mean parameter correspondingto θ, then the following lower bound holds as consequence of theorem 4.1.6:

Ax(η(θ)) ≥ 〈µ(θ), η(θ)〉 − A∗x(µ(θ)).

31

This, together with (5.4), results in:

log p(x|θ) ≥ 〈µ(θ), η(θ)〉 − A∗x(µ(θ))− A(η(θ))

and thus

log pα,β(x) ≥ Epα,β [logpα,β(θ)

pα,β(θ)] + Epα,β [〈µ(θ), η(θ)〉 − A∗x(µ(θ))− A(η(θ))]

Again, the problem is transformed in optimizing an inequality, this time with respect toµ ∈Mx and α, β. So the term that needs to be optimized is

Epα,β [logpα,β(θ)

pα,β(θ)] + Epα,β [〈µ, η(θ)〉 − A∗x(µ)− A(η(θ))]

= Epα,β [〈α, η(θ)〉 − βA(η(θ))−B(α, β)− 〈α, η(θ)〉+ βA(η(θ))

+B(α, β)] + 〈µ,Epα,β [η(θ)]〉 − A∗x(µ)− Epα,β [A(η(θ)]

= 〈α− α, η〉+ 〈A, β − β〉 −B(α, β) +B(α, β)

+ 〈µ, η〉 − A∗x(µ)− A

where η = Epα,β [η(θ)] and A = Epα,β [A(η(θ))]. Writing out definitions gives the followingequality:

B∗(η, A) = 〈η, α〉+ 〈−A, β〉 −B(α, β) (5.5)

Now using this the term that needs to be optimized can be transformed in the followingway:

〈α− α, η〉+ 〈A, β − β〉 −B(α, β) +B(α, β) + 〈µ, η〉 − A∗x(µ)− A= 〈α, η〉 − 〈α, η〉 − 〈−A, β〉 − 〈A, β〉 −B(α, β) +B(α, β) + 〈µ, η〉 − A∗x(µ)− A= 〈µ+ α, η〉 − A∗x(µ)−B∗(η, A) + 〈β + 1,−A〉 −B(α, β).

This term needs to be optimized with respect to (µ, η, A), where the optimizing over ηand A naturally corresponds to optimizing over α and β. Because the optimal value of µdepends on the optimal value of η and A and the same holds the other way around, theoptimization values need to be iteratively updated each time, where of course

µ(n) = arg maxµ∈Mx

〈µ, η(n−1)〉 − A∗x(µ) (5.6)

and(η(n), A(n)) = arg max

(η,A)〈µ(n) + α, η〉 − (1 + β)A−B∗(η, A). (5.7)

By differentiating the term in (5.6) it can easily be checked that the optimal value of µ(n)

is given by

µ(n) =

∫Yφ(x, y)p(y|x, η(n−1))dy. (5.8)

32

In the same way, (α, β) is updated as

(α(n), β(n)) = (α + µ(n), β + 1),

what results in

η(n) =

∫η(θ)pα(n),β(n)(θ)dθ. (5.9)

33

6. Conclusion and Discussion

Statistical inference in graphical models can be performed by using exact algorithms.However, in the best case exact algorithms can only compute distributions in graphicalmodels with exponential computational complexity (assuming P 6= NP). Due to the largeamount of available data, graphical models are getting larger, meaning that executingexact algorithms for calculations is too time consuming. That is why it is desirable tolook at the alternative: approximation methods.

The approximating method analyzed in this thesis is called variational inference. LetG be a graphical model for which the factors that together form the joint probabilitydistribution are exponential families. The joint probability distribution is then also anexponential family. To approximate this distribution one can rewrite the cumulant func-tion A(θ), which is the normalizing factor, as

A(θ) = supµ∈M〈θ, µ〉 − A∗(µ),

where M is the set of realizable mean parameters

M =µ ∈ Rd | ∃q s.t.

∫φ(x)q(x)ν(dx) = µ

and A∗(µ) = −H(pθ(µ)(x)), the negative Boltzmann-Shannon entropy for the exponentialfamily with parameter θ(µ). This is the parameter θ for which µ = Eθ[φ(X)]. As itturns out, the complexity of M makes it often difficult to take a supremum, and A∗(µ)does not always have an explicit form. Mean Field approximation deals with these twodifficulties by reducingM to the setMtract which only looks at distributions q which aretractable. The value of A(θ) can then be approximated by maximizing the lower bound〈θ, µ〉 −A∗(µ). This is equal to minimizing the Kullback-Leibler divergence between thetrue distribution p and all tractable distributions q, defined as:

D(q||p) =

∫Xq(x)

[log

q(x)

p(x)

]ν(dx).

So variational inference transforms the problem of computing the distribution to minimiz-ing the KL divergence. For a Bayesian network this KL divergence can be minimized byusing the Variational Message Passing algorithm, which initializes an initial distributionfor the tractable distribution q and then cycles through the factors in the distribution ofq and updates each factor iteratively.

Statistical applications of variational inference can be found in Bayesian statistics inthe approximation of the posterior and the marginal likelihood.

34

The main problem with variational inference is that there are no results yet (to ourknowledge) proving anything about the accuracy or the rate of convergence. So it isonly known that for a lot of graphical models, variational inference works fast and ap-proximates quite well, but it is not yet proved that it can perform the computations inpolynomial time. The first steps in the analysis of the convergence are however alreadydone (see Tatikonda and Jordan [14]). A proposal for further research will be to look atconvergence of variational methods.

In this thesis the focus was on graphical models in which the joint probability distribu-tions were exponential families. Further research could be done by looking at variationalinference in graphical models without exponential family distributions.

Furthermore, proposals have been done for combining other exact or approximatingmethods with variational inference. So further research can also be done in combiningsampling methods and variational inference or exact methods and variational inference.In addition, multiple methods of variational inference can be combined.

35

7. Popular summary (in Dutch)

Een graaf bestaat uit een verzameling punten, die we ook wel knopen noemen, en eenaantal lijnen die sommige van de punten verbinden. Een grafisch model bestaat ookuit een verzameling knopen en lijnen. De knopen stellen echter stochasten oftewel toe-valsvariabelen voor. Dit zijn variabelen waarvan de waarde afhangt van de uitkomst vaneen kansexperiment. Een voorbeeld is de stochast X die de waarde aanneemt van hetaantal ogen van een dobbelsteen na eenmaal werpen. Het werpen van een dobbelsteenis een kansexperiment en X is een variabele waarvan de waarde geheel afhangt van deuitkomst van dat kansexperiment. De lijnen in een grafisch model geven de afhankelijk-heid tussen de stochasten in het grafisch model weer. Als twee stochasten X en Yhelemaal niet verbonden zijn (zie figuur 7.1), dan zijn X en Y onafhankelijk. Dit betekentdat als we de waarde van X weten, dit niet kan beınvloeden wat de waarde van Y wordt.Echter, als er een pad van X naar Y loopt zijn de stochasten niet onafhankelijk enbeınvloedt de waarde van X die van Y en omgekeerd.

Het doel in grafische modellen is nu om kansen binnen het model te bepalen, metname de onderliggende kansverdeling van het model. Dit is de multivariate verdeling vanalle stochasten in het model, die voor een model met de stochasten X1, . . . , Xn wordtgegeven door p(X1 = x1, X2 = x2, . . . , Xn = xn), waarbij x1 een waarde is die X1 kanaannemen, x2 een waarde die X2 kan aannemen, etc. Er zijn exacte manieren om ditsoort kansen in grafische modellen te berekenen. Het blijkt echter dat dit in het bestegeval exponentiele tijd kost. Dit betekent dat er grafische modellen bestaan waarvoorgeldt dat telkens als we in een dergelijk grafisch model een knoop toevoegen, de tijddie nodig is voor het berekenen van de kans met een factor wordt vermenigvuldigd. Ingrote grafische modellen kan het exact berekenen van kansen dus tijdconsumerend zijn.Omdat in de natuurkunde en KI veel grote grafische modellen voorkomen is het daaromwenselijk om ook naar manieren te kijken die kansen benaderen en misschien wel mindertijdconsumerend zijn.

Er zijn grafische modellen waarvan bekend is dat de onderliggende verdeling een zoge-naamde exponentiele familie is, die afhangt van een parameter θ. De verdelingsfunctiebevat de term A(θ), die de normalisatieconstate voorstelt en ervoor zorg dat de totale kans1 wordt, zoals een verdeling behoort. Het blijkt dat het tijdconsumerende van het bereke-

X Z W Y X Z W Y

Figuur 7.1.: Twee grafische modellen, beide met stochasten W,X, Y, Z. In het modellinks zijn X en Y op geen manier verbonden en dus onafhankelijk. In hetmodel rechts loopt er via Z en W een pad van X naar Y waardoor ze nietonafhankelijk zijn.

36

nen van de verdelingsfunctie het berekenen van deze normalisatieconstante is. Een vande manieren om de onderliggende kans in een grafisch model te benaderen transformeerthet probleem van het berekenen van A(θ) dan ook in een optimalisatieprobleem, waar-bij we A(θ) benaderen door de optimale waarde van een bepaalde term te vinden. Hetoptimaliseren gebeurt over alle verdelingen q. Dus er wordt naar een verdeling q gezochtdie de optimalisatieterm optimaliseert en zo A(θ) het beste benadert. Deze manier omde onderliggende kans te bepalen wordt variationele inferentie genoemd.

Het vinden van de optimale waarde zodat A(θ) het beste benaderd wordt blijkt echterook niet gemakkelijk. De Mean Field methode vergemakkelijkt het probleem door niet teoptimaliseren over alle verdelingen q, maar slechts een deel van alle verdelingen. Het blijktdat het benaderen van A(θ) dan overeenkomt met het minimaliseren van de Kullback-Leibler divergentie tussen de echte onderliggende verdeling p en het deel van de verdeling-en q die nog in beschouwing zijn. Voor discrete verdelingen is deze Kullback-Leiblerdivergentie gegeven door

D(q||p) =∑X1,...,Xn

q(x1, . . . , xn)[log

q(x1, . . . , xn)

p(x1, . . . , xn)

]Grafische modellen zijn dus grafen waarin de knopen stochasten voorstellen en de lijnen

tussen de knopen de afhankelijkheid van de stochasten modelleren. Het bepalen van deonderliggende kans in het model op een exacte manier blijkt voor sommige grafen tetijdconsumerend. Variationele inferentie in grafische modellen waarvan de onderliggendeverdeling een exponentiele verdeling is transformeert het probleem van het berekenenvan de normalisatieconstante A(θ) naar het optimaliseren van een bepaalde term overalle verdelingen q. Voor een deel van alle verdelingen q blijkt dit hetzelfde te zijn alshet minimaliseren van de Kullback-Leibler divergentie tussen die verdelingen en de echteonderliggende kansverdeling.

37

Bibliography

[1] Blei, D.M. (2004). Probablistic Models of Text and Images. University of California,Berkeley.

[2] Cheung, S. (2008). Proof of Hammersley-Clifford Theorem. University of Kentucky,Kentucky

[3] Cook, S.A. (2006). “P versus NP problem”. In The Millennium Prize Problems. ClayMathematics Institute, Cambridge, American Mathematical Society, Providence

[4] Hammersley, J.M., Clifford, P. (1971). Markov Fields on Finite Graphs and Lattices.University of California, Berkeley, University of Oxford, Oxford

[5] Hiriart-Urruty, J., Lemarechal (1993). Convex Analysis and Minimization AlgorithmsVol 1. Springer-Verlag, New York.

[6] Jordan, M.I. “Graphical Models”. Statistical Science 19, 1 (2004): 140-155

[7] Jordan, M.I., Ghahramani, Z., Jaakkola, T., Saul, L. “An Introduction to VariationalMethods for Graphical Models”. Machine Learning 37, (1999): 183 - 233

[8] Klein, B.J.K. (2015). Bayesian Statistics, Lecture Notes 2015. University of Amster-dam, Amsterdam

[9] Koller, D.,Friedman, N. (2009). Probablistic Graphical Models: principles and tech-niques. The MIT Press, Cambridge

[10] Koller, D., Friedman, N., Getoor, L. and Taskar, B. (2007). “Graphical Models in aNutshell”. In Introduction to Statistical Relation Learning. The MIT Press, Cambridge

[11] Pollard, D. (2004). Hammersley-Clifford Theorem for Markov Random Fields. YaleUniversity, New Haven

[12] Precup, D. (2008). “Lecture 3: Conditional independence and graph structure” [Lec-ture slides]. Retrieved fromhttp://www.cs.mcgill.ca/ dprecup/courses/Prob/Lectures/prob-lecture03.pdf

[13] Rockafellar, G. (1970). Convex Analysis. Princeton University Press, Princeton

[14] Tatikonda, S., Jordan, M.I. (2002). “Loopy belief propagation and Gibbs measures”.In Uncertainty in Artificial Intelligence. Morgan Kaufmann, San Mateo

[15] Valiant, L.G. “The Complexity of Computing the Permanent”. Theoretical ComputerScience 8 (1979): 189-201

38

[16] Wainwright, M.J., Jordan M.I. (2003). “Variational inference in graphical models:The view from the marginal polytype”. Allerton Conference on Control, Communi-cation and Computing. October 1–3. Urbana-Champaign

[17] Wainwright, M.J., Jordan, M.I. “Graphical Models, Exponential Families, and Vari-ational Inference”. Foundations and Trends in Machine Learning 1, 1-2 (2008): 1 -305

[18] Winn, J., Bishop, M. “Variational Message Passing”. Journal of Machine LearningResearch 6, (2005): 661-694

39

A. Proofs, chapter 2

Theorem 2.1.5. Let G = (V , E) be a Bayesian network over a set of random variablesXv : v ∈ V and let p be a joint distribution over the space in which Xv : v ∈ V aredefined. All the local Markov assumptions associated with G hold in p if and only if pfactorizes according to G. [10]

Proof. Let G = (V , E),V = 1, . . . , n be a Bayesian network over the random variablesXv : v ∈ V and let p be a joint distribution over the same space. Assume all localMarkov properties associated with G hold in p. Without loss of generality it can beassumed that X1, . . . , Xn have a topological ordering relative to G, meaning that if Xi →Xj ∈ E then i < j. Induction to n and use of the fact that for random variables Y1, . . . , Yn

p(Y1, . . . , Yn) = p(Yi|Y1, . . . , Yi−1, Yi+1, . . . , Yn)p(Y1, . . . , Yi−1, Yi+1, . . . , Yn)

holds, the joint probability function for X1, . . . , Xn can be written as

p(X1, . . . , Xn) =n∏i=1

p(Xi|X1, . . . , Xi−1)

For i ∈ 1, . . . , n, consider p(Xi|X1, . . . , Xi−1). Due to the topological ordering, for aparent Xj of Xi j < i holds, such that πXi ⊆ X1, . . . , Xi−1. In the same way, for adescendant Xk of Xi k > i holds, such that none of the descendants of Xi is contained inthe set X1, . . . , Xi−1. Now the following notation can be used: X1, . . . , Xn = πXi∪Z,where Z is a collection consisting of those random variables which are neither a parentor a descendant of Xi. Then, using the local Markov assumptions:

p(Xi|X1, . . . , Xi−1) = p(Xi|πXi ∪ Z) =p(Xi, Z|πXi)p(Z|πXi)

=p(Xi|πXi)p(Z|πXi)

p(Z|πXi)= p(Xi|πXi)

This holds for all 1 ≤ i ≤ n, which leads to

p(X1, . . . , Xn) =n∏i=1

p(Xi|πXi).

and thus the conclusion that p factorizes over G.

40

Now assume p factorizes over G, so assume p(X1, . . . , Xn) =∏n

i=1 p(Xi|πXi). De-note DXi for the descendants of Xi and NXi for the nondescendants of Xi. ThenX1, . . . , Xn = Xi ∪ πXi ∪DXi ∪NXi . Now:

p(Xi|πXi , NXi) =p(Xi, πXi , NXi)∑

xi∈dom(Xi)p(Xi = xi, πXi , NXi)

Due to the fact that p factorizes over G, the numerator can be written as

p(Xi, πXi , NXi) =∑DXi

p(Xi, πXi , NXi , DXi) =∑DXi

p(X1, . . . , Xn) =∑DXi

n∏j=1

p(Xj|πXj)

= p(Xi|πXi)∏

Xj∈NXi

p(Xj|πXj)∏

Xk∈πXi

p(Xk|πXk)∑DXi

∏Xl∈DXi

p(Xl|πXl)

= p(Xi|πXi)∏

Xj∈NXi

p(Xj|πXj)∏

Xk∈πXi

p(Xk|πXk)

Using this result, the sum in the denominator can be transformed as follows:∑xi∈dom(Xi)

p(Xi = xi, πXi , NXi) =∑

xi∈dom(Xi)

p(Xi = xi|πXi)∏

Xj∈NXi

p(Xj|πXj)

·∏

Xk∈πXi

p(Xk|πXk)

=∏

Xj∈NXi

p(Xj|πXj)∏

Xk∈πXi

p(Xk|πXk)

Replacing the nominator and denominator results in p(Xi|πXi , NXi) = p(Xi|πXi), whichmeans Xi and NXi are conditionally independent given πXi . This concludes the proof, asthis is exactly what the local Markov assumptions are. [12]

Theorem 2.2.7. (Hammersley-Clifford) Let H = (X ,D) be an MRF and p a positivedistribution over X , meaning p(x) ≥ 0 for all x ∈ V al(X). If the local Markov propertiesimplied by H hold in p, then p is a Gibbs distribution over H. [10]

The original proof of the Hammersley-Clifford theorem can be found in [4]. However,the proof followed here is a constructive proof [11]. In this proof the following lemma isused:

Lemma A.0.1. Let X = 1, . . . , n, n ∈ N be the collection of nodes in the graphthat index the random variables X1, . . . , Xn. Leg g be any real-valued function ondom((X1, . . . , Xn)). For each subset A ⊆ 1, . . . , n define

gA(x) = g(y) yi =

xi if i ∈ A0 if i ∈ Ac

andΨA(x) =

∑B⊆A

(−1)|A|−|B|gB(x)

Then

41

(i) The function ΨA depends on x = (x1, . . . , xn) ∈ dom((X1, . . . , Xn)) only throughthose coordinates xj with j ∈ A (in particular, Ψ∅ is constant);

(ii) for A 6= ∅, if xi = 0 for at least one i ∈ A then ΨA(x) = 0;

(iii) g(x) =∑

A⊆X ΨA(x)

Proof. (i) This is rather trivial, because for every B ⊆ A gB(x) does not depend onthe variables xj : j 6= A, which means ΨA neither depends on those variables asit only depends on x via gB(x).

(ii) Let xi = 0 for some i ∈ A. Then the collection of subsets of A can be divided intwo subcollections B1 and B2: B1 = B ⊆ A : i ∈ B, B2 = B ⊆ A : i /∈ B.Let B ∈ B1, which means i ∈ B. But then B = B \ i ∈ B2. So every elementof B1 has a corresponding element in B2, which is the same set only without theelement i. In the same way for each B ∈ B2 there is a corresponding B ∈ B1,namely B = B ∪ i. Now let B ∈ B1 and denote its corresponding element in B2

as B. Then, because xi = 0 it holds that gB(x) = gB(x). This and the fact that|B| = |B|+ 1 gives the following result:

(−1)|A|−|B|gB(x) + (−1)|A|−|B|gB(x) = (−1)|A|−|B|−1gB(x) + (−1)|A|−|B|gB(x)

= (−1)|A|−|B|(−gB(x) + gB(x)) = 0

Because there is a bijection between the sets B1 and B2, this gives the desired result:

ΨA(x) =∑B⊆A

(−1)|A|−|B|gB(x) =∑B∈B1

(−1)|A|−|B|gB(x) + (−1)|A|−|B|gB(x) = 0.

(iii) ∑A⊆X

ΨA(x) =∑A⊆X

∑B⊆A

(−1)|A|−|B|gB(x).

A subset A ⊆ X contributes (−1)|A|−|B| to the coefficient of gB(x) if B ⊆ A. Thismeans the coefficient of gB(x) can be written as∑

A

1B⊆A⊆X(−1)|A|−|B| =∑E⊆Bc

(−1)|E|.

For Bc = ∅, so B equal to X , this coefficient is 1. For Bc 6= ∅ it is a set-theoreticresult that exactly half of the subsets E have |E| even and half of the subsets |E|odd, leading to a coefficient of 0. Now it can be concluded that

∑A⊆X ΨA(x) =

gX (x) = g(x).

Proof of Theorem 2.2.7. Let X = 1, . . . , n, n ∈ N be the collection of nodes in thegraph that index the random variables X1, . . . , Xn. Let

g(y) = log p(X1 = y1, . . . , Xn = yn)

42

for y = (y1, . . . , yn) ∈ dom((X1, . . . , Xn)). This is a real-valued function on the domainof X1, . . . , Xn, so according to lemma A.0.1(iii) it holds that

p(X1 = x1, . . . , Xn = xn) = e∑A⊆X ΨA(x) =

∏A⊆X

eΨA(x).

The factors eΨA(x) are positive and only depend on A, as proven in lemma A.0.1(i). Thismeans the distribution p factorizes over H with factors eΨA(x) if it can be proven thatΨA(x) = 0 and thus eΨA(x) = 1 for subsets A which are not complete. Let A be subsetof X which is not complete and without loss of generality let 1, 2 ∈ A where 1 and 2 arenot connected. Let B ⊂ A and B = B ∪ 1. Because |B| and |B| only differ 1, thecontribution that the pair makes to the sum in the definition of ΨA(x) is ±(gB(x)−gB(x)).Now:

gB(x)− gB(x)

= log p(X1 = x1, X2 = y2, . . . , Xn = yn)− log p(X1 = 0, X2 = y2, . . . , Xn = yn)

= logp(X1 = x1|X2 = y2, . . . , Xn = yn)

p(X1 = 0|X2 = y2, . . . , Xn = yn)

Because 2 is not a direct neighbour of 1, the local Markov assumptions now give thatboth conditional probabilities do not depend on the value of X2. This holds for everypair B, B, so ΨA(x) does not change if X2 = y2 is replaced by X2 = 0. However, fromlemma A.0.1(ii) it now follows that ΨA(x) = 0. It can be concluded that eΨA(x) = 1 if Ais not complete. So p is a Gibbs distribution over H.

43

B. Proof, chapter 3

Theorem 3.2.4. The following problem is NP-complete: given a Bayesian network Gover X , a variable X ∈ X , and a value x ∈ V al(X), decide whether p(X = x) > 0.

Proof. To prove this theorem, first is proved that this decision problem is in NP and thenthat it is NP-hard.

A1 A2 A3 X

C5C4C3C2C1

Q1 Q2 Q3 Q4 Q5 Q6 Q7

Figure B.1.: Bayesian network with Val(Qi) = qi,¬qi. There is an edge from Qi toCj if Cj contains qi or ¬qi. So C1 contains q1 or ¬q1. Furthermore, Ai =Ai−1 ∩ Ci+1.

To check if the problem is in NP, a full assignment ξ to the network variables is guessed.At first is checked if ξ implies X = x. Thereafter, it is checked whether p(ξ) > 0. Thelast check can be done in linear time, due to the chain rule for Bayesian networks. Oneof the guesses results in a success if p(X = x) > 0.

Left to prove is NP-hardness. To prove this one could prove all other problems inNP can in polynomial time be transformed to the problem. However, there are alreadyproblems for which NP-completeness is proved and proving the above decision problemcan be transformed in such a known NP-complete problem in polynomial time also provesNP-hardness. The latter way will be used to prove NP-hardness here.

The already known NP-complete problem that will be used is the 3-SAT problem(example 3.2.2), which is one of Karp’s 21 NP-complete problems and is often usedto prove NP-hardness of other decision problems. So proven will be that the decisionproblem in the theorem can be transformed to the 3-SAT problem in polynomial time.

Let φ = C1 ∧ · · · ∧ Cm be a 3-SAT formula over the propositional variables q1, . . . , qn.The goal is to create, in polynomial time, a Bayesian network G with a variable X, suchthat there exists a satisfying assignment to φ if and only if p(X = x) > 0. Let G containthe nodes Q1, . . . , Qn, where V al(Qk) = qk,¬qk and p(qk) = 0.5. Furthermore, let Galso contain the nodes C1, . . . , Cm, where each node Ci corresponds to a clause in theformula φ (see figure B.1). Let there be an edge from Qk to Ci if the clause Ci contains

44

either qk or ¬qk in one of the literals. The conditional probability distribution of Ci nowhas the form p(`h∨`j∨`k|Qh, Qj, Qk), where `l ∈ ql,¬ql. From this can be seen that theconditional probability distribution has at most eight distributions and at most sixteenentries, because the clause `h ∨ `j ∨ `k either has value 0 or 1 given the values of Qh, Qj

and Qk. Thus, the conditional probability distribution for Ci can be chosen as such thatit duplicates the behavior of the clause: when the clause if false the probability at Ci willbe 0 and when the clause is true the probability will be 1.

Besides Q1, . . . , Qn and C1, . . . , Cm, let G also contain A1, . . . , Am−2 ∩ X, in-ductively defined as Ai = Ai−1 ∩ Ci+1 and X being Am−2 ∩ Cm. If all clauses Cj returntrue, all nodes Ai return true and when this happens let X have value x. It follows thatq1, . . . , qn is a satisfying assignment for φ if and only if p(X = x|q1, . . . , qn) = 1. Now,

p(X = x) =∑

Q1,...,Qn

p(X = x|q1, . . . qn)p(q1, . . . , qn) =∑

Q1,...,Qn

1

2np(X = x|q1, . . . , qn),

resulting in the fact that p(X = x) is the number of satisfying assignments for φ (becauseq1, . . . , qn is a satisfying assignment for φ if and only if p(X = x|q1, . . . , qn) = 1) dividedby 2n. So φ has a satisfying assignment if and only if p(X = x) > 0.

45

C. Proofs, chapter 4

Theorem 4.1.3. In a minimal exponential family, the gradient map ∇A is onto M.Consequently, for each µ ∈M, there exists some θ = θ(µ) ∈ Ω such that Eθ[φ(X)] = µ.

Proof. For this proof, the following known facts for convex sets S ⊂ Rd will be used:

(i) S is full-dimensional if and only if the affine hull of S is equal to Rd.

(ii) if S is full-dimensional it has nonempty interior and (S) = S.

(iii) Let S be full-dimensional. S contains the zero vector if and only if for all nonzeroγ ∈ Rd, there exists s ∈ S with 〈γ, s〉 > 0.

Let the representation of the exponential family be minimal. Assume M is not full-dimensional. Then by (i), its affine hull is not equal to Rd. This is equivalent to the factthat there exists an a ∈ Rd with 〈a, µ〉 = b for some b ∈ R for all µ ∈ M. But then forall distributions p(x) the following holds:

b = 〈a,∫φ(x)p(x)ν(dx)〉 =

∫aTφ(x)p(x)ν(dx)

But also:

b =

∫bp(x)ν(dx),

indicating that 〈a, φ(x)〉 = b which is a contradiction with the fact that the exponentialfamily has minimal representation and thus leads to the conclusion that M has to befull-dimensional. This means the properties (ii) and (iii) hold.

Now first the inclusion ∇A(Ω) ⊆ M is proven. Consider the case 0 ∈ ∇A(Ω) (when0 6= ∇A(Ω), shift φ by a constant vector, this gives 0 ∈ ∇A(Ω)). Then there exists aθ0 ∈ Ω with ∇A(Θ0) = 0. If it can be proven that for any nonzero γ ∈ Rd there existsµ ∈ M such that 〈γ, µ〉 > 0, property (iii) implies 0 ∈ M. So, take a nonzero γ ∈ Rd.Ω is open which implies that there exists some ε > 0 with θ0 + εγ ∈ Ω. Now, because Ais differentiable and strictly convex on Ω, the following holds:

A(θ0 + εγ) > A(θ0) + 〈∇A(θ0), εγ〉 = A(θ0), (C.1)

where it is used that ∇A(θ0) = 0. In the same way the following holds:

A(θ0) > A(θ0 + εγ) + 〈∇A(θ0 + εγ),−εγ〉. (C.2)

This can be rewritten as:

ε〈∇A(θ0 + εγ, γ〉 > A(θ0 + εγ)− A(θ0).

46

This, together with (A.1) now implies that ε〈∇A(θ0 + εγ, γ〉 > 0. A(θ0 + εγ) ∈ ∇A(Ω) ⊆M, so with property (iii) this yields 0 ∈M and thus ∇A(Ω) ⊆M.

To showM ⊆ ∇A(Ω), it can in the same way be assumed that 0 ∈M. Now a vectorθ ∈ Ω must be found for which ∇A(θ) = 0. This is by the convexity of A equivalent to thefact that infθ∈Ω A(θ) is attained, which is equivalent to the fact that for all sequences θnwith limn→∞ ||θn|| = ∞ it holds that limn→∞A(θn) = ∞. To prove this fact, considerthe set Hγ,δ := x ∈ X |〈γ, φ(x)〉 ≥ δ, for γ ∈ Rd an arbitrary nonzero vector and δ > 0.Now assume this set has negative measure. Then 〈γ, φ(x)〉 ≤ 0 must hold ν-a.e. whichagain implies 〈γ, µ〉 ≤ 0 for all µ ∈M. ButM is convex, thus 0 /∈ [M] =M (property(ii)). This gives a contradiction, so Hγ,δ must have positive measure under ν, for δ > 0sufficiently small. Now, let θk ∈ Ω be arbitrary. Then, for t ∈ R:

A(θk + tγ) = log

∫X

exp 〈θk + tγ, φ(x)〉ν(dx)

≥ log

∫Hγ,δ

exp 〈θk + tγ, φ(x)〉ν(dx)

≥ log

∫Hγ,δ

exp 〈θk, φ(x)〉 exp tδν(dx)

= tδ + C(θk),

with

C(θk) := log

∫Hγ,δ

exp 〈θk, φ(x)〉ν(dx)

Already proved is nu(Hγ,δ > 0 and it trivially holds that exp θk, φ(x) > 0 for all x ∈ X ,which yields C(θk) > −∞. This gives

limt→∞

A(θk + tγ) ≥ limt→∞

tδ + C(θk) =∞.

γ was arbitrary, thus this holds for all vectors γ ∈ Rd and concludes the proof.

Theorem 4.1.6. (a) For any µ ∈ M, let θ(µ) denote the unique canonical parame-ter corresponding to µ, so ∇A(θ) = µ. The Fenchel-Legendre dual of A has thefollowing form:

A∗(µ) =

−H(pθ(x)) if µ ∈M

∞ if µ /∈M

For any boundary point µ ∈M \M the dual becomes

A∗(µ) = limn→∞

[−H

(pθ(µn)(x)

)],

µn being a sequence in M converging to µ.

(b) In terms of this dual, the log partition function has the following variational repre-sentation:

A(θ) = supµ∈M〈θ, µ〉 − A∗(µ).

47

(c) For all θ ∈ Ω, the supremum in the variational representation of A(θ) is attaineduniquely at the vector µ = Eθ[φ(x)]. This statement holds in both minimal andnonminimal representations.

Proof. The proof of (a) is split in three cases:

µ ∈M: Theorem 4.1.5. gives that (∇A)−1(µ) is nonempty. Recall the definition of theconjugate dual:

A∗(µ) = supθ∈Ω〈µ, θ〉 − A(θ).

Differentiating with respect to θ and setting this equal to 0 results in µ = ∇A(θ).So any point θ ∈ (∇A)−1(µ) attains the supremum, where the optimal value isgiven by

A∗(µ) = 〈θ(µ), µ〉 − A(θ(µ)),

θ(µ) ∈ Ω being the parameter for which ∇A(θ(µ)) = µ. Now it can be seen thatfor this θ(µ):

A∗(µ) = 〈θ(µ), µ〉 − A(θ(µ)) = Eθ(µ)[〈θ(µ), φ(x)〉 − A(θ(µ))]

=

∫Xpθ(µ)(x) log pθ(µ)(x) = −H(pθ(µ))

µ /∈M: Define dom(A∗) := µ ∈ Rd | A∗(µ) < +∞. Then what is to be proven is thatdom(A∗) ⊆ M. A result already derived in analysis [Rockafellar [13]] is that if Ais essentially smooth and lower semi-continuous (definitions are given below), then

[dom(A∗)] ⊆ ∇A(Ω) ⊆ dom(A∗),

which is equivalent to

[dom(A∗)] ⊆M ⊆ dom(A∗),

because ∇A(Ω) = M. Then by the convexity of both A and M, this givesdom(A∗) = [M] = M and thus dom(A∗) ⊆ M, which is exactly what neededto be proven.

Thus the only thing that still needs to be proved is that A is essentially smoothand lower semi-continuous, for which the definitions are as follows:

Definition C.0.1. A convex function f is lower semi-continuous if limy→x f(y) ≥f(x). A convex function f is essentially smooth if it has nonempty domain C =dom(f), is differentiable throughout C and limn→∞∇f(xn) =∞ for any sequencexn ∈ C.

As proven in lemma 4.1.3, A is differentiable, which implies continuity and thuslower semi-continuity. To prove the fact that A is essentially smooth, note that Ωis nonempty and A is differentiable on Ω = Ω. Furthermore, let θk be a boundary

48

point of Ω and θl ∈ Ω. Then, by using the fact that Ω is convex and open, it holdsthat for all m ∈ [0, 1):

θm := mθk + (1−m)θl ∈ Ω,

where again a result from Rockafellar [13] is used. Now, using the convexity of Ωand the differentiability of A, it follows that

A(θl) ≥ A(θm) + 〈∇A(θm), θl − θm〉.

Rewriting this and using the Cauchy-Schwarz inequality yields

A(θm)− A(θk) ≤ ||θm − θk||||∇A(θm)||.

How, when taking m → 1−, the lower semi-continuity of A and the regularityof the exponential family ensure that the left hand side goes to infinity. This ofcourse means that the left hand side also goes to infinity. ||θm− θk|| is bounded, so||∇A(θm)|| → ∞, meaning A is indeed essentially smooth.

µ ∈M \M: A∗ is a conjugate function and defined as the supremum of a collection of functionswhich are linear in µ. Thus A∗ is lower semi-continuous. This means the value ofA∗(µ) for µ in the boundary is equal to the limit over a sequence mun ∈ M

with limit µ.

To prove part (b), recall the definition of the conjugate dual:

A∗(µ) = supθ∈Ω〈µ, θ〉 − A(θ).

It can easily be checked that (A∗)∗ = A, meaning

A(θ) = (A∗)∗(θ) = supµ∈dom(A∗)

〈θ, µ〉 − A∗(µ) = supµ∈M〈θ, µ〉 − A∗(µ),

where the last step follows from (a), where it is already stated that domA∗ =M.To prove (c) note that it is already proved there is a bijection between Ω and M,

given by ∇A. Again, in Rockafellar [13] is shown that this means that ∇A∗ exists and isbijective as well. This means that when θ ∈ Ω, the supremum is attained at one uniquevector, which is the vector µ for which ∇A(θ) = µ and thus µ = Eθ[φ(x)].

49

Variational Inference in Graphical ModelsVariational Inference in Graphical Models Nikki Levering...

Documents

Transcript of Variational Inference in Graphical ModelsVariational Inference in Graphical Models Nikki Levering...