Bayesian Conditional Random Fields -...

Bayesian Conditional Random Fields

Yuan Qi [email protected] Computer Science and Artificial Intelligence Laboratory32 Vassar Street, Cambridge, MA 02139, USA

Martin Szummer [email protected] Research, Cambridge, CB3 0FB, United Kingdom

Thomas P. Minka [email protected]

Microsoft Research, Cambridge, CB3 0FB, United Kingdom

Editor:

Abstract

We propose Bayesian Conditional Random Fields (BCRFs) for classifying interdependentand structured data, such as sequences, images or webs. BCRFs are a Bayesian approachto training and inference with conditional random fields, which were previously trained bymaximizing likelihood (ML) (Lafferty et al., 2001). Our framework avoids the problemof overfitting, and offers the full advantages of a Bayesian treatment. Unlike the MLapproach, we estimate the posterior distribution of the model parameters during training,and average predictions over this posterior during inference. We apply two extensions ofexpectation propagation (EP), the power EP and the novel transformed EP methods, toincorporate the partition function. For algorithmic stability and accuracy, we flatten theapproximation structures to avoid two-level approximations. We demonstrate the superiorprediction accuracy of BCRFs over conditional random fields trained with ML or MAP onsynthetic and real datasets.Keywords: Conditional random fields, Power Expectation Propagation (PEP), Trans-formed PEP, Bayesian model averaging, Structured data

1. Introduction

Traditional classification models often assume that data items are independent. However,real world data is often interdependent and has complex structure. Suppose we want toclassify web pages into different categories, e.g., homepages of students versus faculty. Thecategory of a web page is often related to the categories of pages linked to it. Ratherthan classifying pages independently, we should model them jointly to incorporate suchcontextual cues.

Joint modeling of structured data can be performed by generative graphical models,such as Bayesian networks or Markov random fields. For example, hidden Markov modelshave been used in natural language applications to assign labels to words in a sequence,where labels depend both on words and other labels along a chain. However, generativemodels have fundamental limitations. Firstly, generative models require specification of

1

the data generation process, i.e., how data can be sampled from the model. In manyapplications, this process is unknown or impractical to write down, and not of interest for theclassification task. Secondly, generative models typically assume conditional independenceof observations given the labels. This independence assumption limits their modeling powerand restricts what features can be extracted for classifying the observations. In particular,this assumption rules out features capturing long-range correlations, multiple scales, orother context.

Conditional random fields (CRF) are a conditional approach for classifying structureddata, proposed by Lafferty et al. (2001). CRFs model only the label distribution con-ditioned on the observations. Unlike generative models, they do not need to explain theobservations or features, and thereby conserve model capacity and reduce effort. This alsoallows CRFs to use flexible features such as complex functions of multiple observations.The modeling power of CRFs has shown great benefit in several applications, such as natu-ral language parsing (Sha & Pereira, 2003), information extraction (McCallum, 2003), andimage modeling (Kumar & Hebert, 2004).

To summarize, CRFs provide a compelling model for structured data. Consequently,there has been an intense search for effective training and inference algorithms. The firstapproaches maximized conditional likelihood (ML), either by generalized iterative scalingor by quasi-Newton methods (Lafferty et al., 2001; Sha & Pereira, 2003). However, the MLcriterion is prone to overfitting the data, especially since CRFs are often trained with verylarge numbers of correlated features. The maximum a posteriori (MAP) criterion can reduceoverfitting, but provides no guidance on the choice of parameter prior. Furthermore, largemargin criteria have been applied to regularize the model parameters and also to kernelizeCRFs (Taskar et al., 2004; Lafferty et al., 2004). Nevertheless, training and inference forCRFs remain challenges, and the problems of overfitting, feature and model selection havelargely remained open.

In this paper, we propose Bayesian Conditional Random Fields (BCRF), a novel Bayesianapproach to training and inference for conditional random fields. Applying the Bayesianframework brings principled solutions and tools for addressing overfitting, model selectionand many other aspects of the problem. Unlike ML, MAP, or large-margin approaches,we train BCRFs by estimating the posterior distribution of the model parameters. Subse-quently, we can average over the posterior distribution for BCRF inference.

The complexity of the partition function in CRFs (the denominator of the likelihoodfunction) necessitates approximations. While traditional EP and variational methods cannot be directly applied, because of the presence of partition functions, the power EP (PEP)method (Minka, 2004) can be used for BCRF training. However, PEP can lead to con-vergence problems. To solve these problems, this paper proposes the transformed PEPmethod to incorporate partition functions in BCRF training. Furthermore, we flatten theapproximation structures to avoid two-level approximations. This significantly enhances thealgorithmic stability and improves the estimation accuracy. For computational efficiency,we use low-rank matrix operations.

This paper first formally defines CRFs, presents the power EP and the new transformedPEP methods, and flattens the approximation structure for training. Then it proposes anapproximation method for model averaging, and finally shows experimental results.

2

2. From conditional random fields to BCRFs

A conditional random field (CRF) models label variables according to an undirected graph-ical model conditioned on observed data (Lafferty et al., 2001). Let x be an “input” vectordescribing the observed data instance, and t = ti be an “output” random vector overlabels of the data components. We assume that all labels for the components belong toa finite label alphabet T = 1, . . . , T. For example, the input x could be image featuresbased on small patches, and t be labels denoting ’person, ’car’ or ’other’ patches. Formally,we have the following definition of CRFs (Lafferty et al., 2001):

Definition 1 Let G = (V, E) be a graph such that t is indexed by the vertices of G. Then(x, t) is a conditional random field (CRF) if, when conditioned on x, the random variablesti obey the Markov property with respect to the graph: p(ti | x, tV−i) = p(ti | x, tNi) whereV− i is the set of all nodes in G except the node i, Ni is the set of neighbors of the node iin G, and tΩ represents the random variables of the vertices in the set Ω.

Unlike traditional generative random fields, CRFs only model the conditional distributionp(t | x) and do not explicitly model the marginal p(x). Note that the labels ti are globallyconditioned on the whole observation x in CRFs. Thus, we do not assume that the observeddata x are conditionally independent as in a generative random field.

BCRFs are a Bayesian approach to training and inference with conditional randomfields. In some sense, BCRFs can be viewed as an extension of conditional Bayesian linearclassifiers, e.g., Bayes point machines (BPM) (Herbrich et al., 1999; Minka, 2001), whichare used to classify independent data points.

According to the Hammersley-Clifford theorem, a CRF defines the conditional distribu-tion of the labels t given the observations x to be proportional to a product of potentialfunctions on cliques of the graph G. For simplicity, we consider only pairwise clique poten-tials such that

p(t | x,w) =1

Z(w)

∏i,j∈E

gi,j(ti, tj ,x;w), (1)

where

Z(w) =∑t

∏i,j∈E

gi,j(ti, tj ,x;w). (2)

is a normalizing factor known as the partition function, gi,j(ti, tj ,x;w) are pairwise poten-tials, and w are the model parameters. Note that the partition function is a complicatedfunction of the model parameters w. This makes Bayesian training much harder for CRFsthan for Bayesian linear classifiers, since the normalizer of a Bayesian linear classifier is aconstant. A simple CRF is illustrated in Figure 1. As a Bayesian approach, BCRFs esti-mate the posterior distribution of the parameters w, the unshaded node in the graphicalmodel. In standard conditional random fields, the pairwise potentials are defined as

gi,j(ti, tj ,x;w) = exp(wTti,tjφi,j(ti, tj ,x)), (3)

3

x

t1 t2

w

Figure 1: Graphical model for a simple Bayesian conditional random field. Conditioned on theobservations x and the parameters w, the labels t1 and t2 compose a simple Markovrandom field. As a Bayesian approach, BCRFs estimate the posterior distribution of theparameters w, given the observations in training.

where φi,j(ti, tj ,x) are features extracted for the edge between vertices i and j of the con-ditional random field, and wti,tj are parameters corresponding to labels ti, tj in w, wherew = [wT

1,1,wT1,2, . . . ,w

TT,T ]T. There are no restrictions on the relation between features.

Instead of using an exponential potential function, we prefer to use the probit functionΨ(·) (the cumulative distribution function of a Gaussian with mean 0 and variance 1).This function is bounded, unlike the exponential, and permits efficient Bayesian training.Furthermore, to incorporate robustness against labeling errors, we allow a small probabilityε of a label being incorrect, thus bounding the potential away from 0. Specifically, ourrobust potentials are:

gi,j(ti, tj ,x;w) = (1− ε)Ψ(wTti,tjφi,j(ti, tj ,x)) + ε(1−Ψ(wT

ti,tjφi,j(ti, tj ,x))). (4)

3. Training BCRFs

Given the data likelihood and a Gaussian prior

p0(w) ∼ N (w | 0,diag(α)), (5)

the posterior of the parameters is

p(w | t,x) ∝ p0(w)Z(w)

∏k∈E

gk(ti, tj ,x;w), (6)

where k = i, j indexes edges in a CRF.Expectation propagation exploits the fact that the posterior is a product of simple

terms. If we approximate each of these terms well, we can get a good approximation of the

4

posterior. Mathematically, EP approximates p(w | t,x) as

q(w) =p0(w)Z(w)

∏k∈E

gk(w) (7)

=1

Z(w)R(w), (8)

where R(w) = p0(w)∏

k∈E gk(w) is the numerator in the approximate posterior distributionq(w). The approximation terms gk(w) and 1/Z(w) have the form of Gaussians, so thatthe approximate posterior q(w) is a Gaussian, i.e., q(w) ∼ N (mw,Σw). We can obtainthe approximation terms gk(w) in the same way as in Bayesian linear classifiers (Minka,2001). Specifically, for the gk term, the algorithm first computes q\k(w), which representsthe “rest of the distribution.” Then it minimizes KL-divergence over gk, holding q\k fixed.This process can be written succinctly as follows:

q\k(w) ∝ q(w)/gk(w) (9)

gk(w)new = argmingk(w)

KL(gk(w)q\k(w) ‖ gk(w)q\k(w)) (10)

= proj[gk(w)q\k(w)

]/q\k(w) (11)

q(w)new = q\k(w)gk(w)new, (12)

where proj is a “moment matching” operator: it finds the Gaussian having the same mo-ments as its argument, thus minimizing KL-divergence. Algorithmically, (9) means “dividethe Gaussians to get a new Gaussian, and call it q\k(w).” Similarly, (11) means “con-struct a Gaussian whose moments match gk(w)q\k(w) and divide it by q\k(w), to get anew Gaussian approximation term gk(w)new.” These are the updates to incorporate theexact terms in the numerator. The main challenge for BCRF training is to approximatethe denominator 1/Z(w) and incorporate its approximation into q(w).

3.1 Power EP and Transformed PEP

This section presents the power EP (PEP) and the transformed PEP methods to approx-imate an integral involving the denominator 1/Z(w). Traditional EP has difficulty incor-porating the denominator, since it is hard to directly compute required moments for theKL minimization in the projection step. The PEP and the transformed PEP methods solvethis problem by changing the projection step.

3.1.1 Power EP

The first method is the power EP method, which raises the approximation terms to somepower in order to make them tractable to calculate. Different powers can be used for differentterms. This idea was first used by Minka and Lafferty (2002) to raise approximation termsto positive powers. However, PEP also works with negative powers, and this is one of thekey insights that makes approximate BCRF training tractable.

5

Specifically, to incorporate a term sk(w) approximated by sk(w), PEP extends EP byusing the following updates with a desired power nk 6= 0:

q\k(w) ∝ q(w)/sk(w)1/nk (13)

sk(w)new =(proj

[sk(w)1/nkq\k(w)

]/q\k(w)

)nk

(14)

q(w)new = q(w)sk(w)new

sk(w)= q\k(w)

sk(w)new

sk(w)1−1/nk. (15)

As shown by Minka (2004), these updates seek to minimize a different measure of divergence,the α-divergence (Amari, 1985), where α = 2/nk − 1. Note that α can be any real number.By picking nk appropriately, the updates can be greatly simplified, while still minimizinga sensible error measure. This result is originally due to Wiegerinck and Heskes (2002).They discussed an algorithm called “fractional belief propagation” which is a special caseof PEP, and all of their results also apply to PEP.

For BCRF training, we will use these PEP updates for the denominator term sk(w) =1/Z(w) by picking nk = −1. Specifically, we have the following updates.

1. Deletion: we compute the “leave-one-out” approximate posterior q\z(w), by remov-ing the old approximation of the partition function Z(w) from the current approxi-mation q(w) as if it were a numerator.

q\z(w) ∝ q(w)Z(w)

=1

(Z(w))2R(w). (16)

2. Projection: given the “leave-one-out” approximate posterior q\z(w), we computethe new “exact” posterior:

q(w) = proj[Z(w)q\z(w)

]= argmin

q(w)KL(q\z(w)Z(w) ‖ q(w)), (17)

by matching the moments of q(w) and q\z(w)Z(w)). Thanks to picking nk = −1, weneed only the moments of Z(w)q\k(w), instead of those of q\k(w)/Z(w). Since wecannot compute the moments of Z(w)q\k(w) analytically, we embed an EP algorithmto approximate these moments (see details in Section 3.2). Note that we cannot applyEP directly to approximate the moments of q\k(w)/Z(w) without resorting to otherapproximation methods, such as quadrature or Monte Carlo.

Given q(w), we update Z(w)new as follows:

Z(w)new ∝ q(w)/q\z(w). (18)

3. Inclusion: we replace the old approximation term Z(w) with a new one to obtainq(w)new.

q(w)new = q\z(w)(Z(w))2

Z(w)new. (19)

6

3.1.2 Transformed PEP

PEP has feedback loops in its updates: the “leave-one-out” approximate posterior q\k(w)involves the old approximation term sk(w). This is because we remove only a fraction ofthe old approximation term from q(w) in the deletion step:

q\k(w) = q(w)/sk(w)1/nk = sk(w)1−1/nk∏j 6=k

sj(w). (20)

Therefore, the new approximation term based on q\k(w) is dependent on the old approxi-mation term. This dependency can cause oscillations in training. To address this problemand obtain faster convergence speeds, we weaken the feedback loops in PEP by introducinga damping factor mk in the deletion step. The damping factor transforms the power ofthe approximation term in the deletion step of PEP. Thus, we name this method trans-formed PEP (TPEP). Specifically, to incorporate the term sk(w) approximated by sk(w),transformed PEP generalizes PEP by using the following updates:

q\k(w) ∝ q(w)/sk(w)1−mk+mknk (21)

sk(w)new =(proj

[sk(w)1/nkq\k(w)

]/q\k(w)

)nk

(22)

q(w)new = q(w)sk(w)new

sk(w)= q\k(w)

sk(w)new

sk(w)mk−mk/nk, (23)

where mk ∈ [0, 1]. If mk = 1, then transformed PEP reduces to PEP. If mk = 0, thenthe “leave-one-out” approximate posterior does not involve any information in the old ap-proximation term sk(w). We name this special case the transformed EP (TEP) method.TEP essentially cuts the feedback loops in PEP. By sharing the same inclusion and deletionsteps, TEP is more similar to EP than PEP is to EP. TEP differs from EP only in theprojection step by transforming the exact and the approximation terms in EP. Hence thename “transformed EP”. In practice, TEP tends to have faster convergence rates than PEP,while sacrificing accuracy to some extent. We can also choose mk between 0 and 1 to dampthe feedback loops and obtain a tradeoff between PEP and TEP.

For BCRF training, we will use the above updates for the denominator term sk(w) =1/Z(w). We pick nk = −1 and mk = 0 such that transformed PEP is reduced to TEP:

1. Deletion: q\z(w) ∝ q(w)Z(w) = R(w). Here q\z(w) does not involve the old termapproximation Z(w), while q\z(w) in PEP does.

2. Projection: q(w) = proj[Z(w)q\z(w)

].

Again, as in the PEP method, we need to use another EP to approximate the requiredmoments. Given the new moments of q(w), we update the new approximation termZ(w)new ∝ q(w)/q\z(w).

3. Inclusion: q(w)new = q\z(w)/Z(w)new.

7

3.2 Approximating the partition function

In the moment matching step, we need to approximate the moments of Z(w)q\z(w) whereZ(w) is the exact partition function, since Z(w) is a very complicated function of w. Murrayand Ghahramani (2004) have proposed approximate MCMC methods to approximate thepartition function of an undirected graph. This section presents an alternative method.

For clarity, let us rewrite the partition function as follows:

Z(w) =∑t

Z(w, t) (24)

Z(w, t) =∏k∈E

gk(ti, tj ,x;w), (25)

where k = i, j indexes edges. To compute the moments of Z(w)q\z(w), we can embedEP. Specifically, the EP approximation factorizes over w and t:

q(w) = proj[Z(w)q\z(w)

](26)

q(w)q(t) = Z(w)Z(t)q\z(w) (27)

Z(w)Z(t) =∏k∈E

fk(w)fk(ti)fk(tj), (28)

where q(t) is the approximate distribution for labels, and fk(w)fk(ti)fk(tj) approximatesgk(ti, tj ,x;w) in the denominator.

Because the approximation is factorized, the computation will have the flavor of loopybelief propagation. Let the initial fk’s all be 1, making the initial q(w) = q\z(w) andq(t) = 1. The EP updates for the fk’s are:

q\k(w) ∝ q(w)/fk(w) (29)

q\k(ti) ∝ q(ti)/fk(ti) (similarly for j) (30)

fk(w) =∑ti,tj

gk(ti, tj ,x;w)q\k(ti)q\k(tj) (31)

fk(w)new = proj[fk(w)q\k(w)

]/q\k(w) (32)

fk(ti)new =∑tj

∫w

gk(ti, tj ,x;w)q\k(w)q\k(tj) dw (33)

q(w)new = q\k(w)fk(w)new (34)

q(ti)new = q\k(ti)fk(ti)new. (35)

These updates are iterated for all k, until a fixed point is reached.

3.3 Flattening the approximation structure

In practice, we found that the approximation method, presented in Sections 3.1.1 and 3.2,led to non-positive definite covariance matrices in training. In this section, we examine thereason for this problem and propose a method to fix it.

8

The approximation method has two levels of approximations, which are visualized at thetop of Figure 2. At the upper level of the top graph, the approximation method iterativelyrefines the approximate posterior q(w) based on the approximation term Z(w); at the lowerlevel, it iteratively refines Z(w) by individual approximation terms fk(w).

A naive implementation of the two-level approximation will always initialize the approx-imation terms fk(w) as 1, such that the iterations at the lower level start from scratchevery time. Thus, removing the denominator Z(w) in the upper level amounts to removingall the previous approximation terms fk(w) in the lower level. The naive implementationrequires the “leave-one-out” approximation q\z(w) ∝ q(w)/Z(w) to have a positive defi-nite covariance matrix. This requirement is often violated in practice, and then the trainingalgorithm chooses to skip the whole denominator Z(w) in the upper level. This skippingdramatically decreases the approximation accuracy.

A better idea is to initialize the approximation terms using the values obtained from theprevious iterations. By doing so, we do not require the covariance of q\z(w) to be positivedefinite anymore. Instead, we need that of q\k(w) in equation (29) be positive definite,which is easier to satisfy. Now, when the iterations in the lower level start, q\k(w) usuallyhas a positive definite covariance. However, after a few iterations, the covariance of q\k(w)often becomes non-positive definite. This instability seems to occur when approximatingcomplicated functions such as the partition function Z(w).

Figure 2: Flattening the approximation structure. The upper graph shows the two-levelapproximation structure of the methods described in the previous sections. Thelower graph shows the flattened single-level approximation structure.

To address the problem, we flatten the two-level approximation structure by expandingZ(w) in the upper level. Now we focus on q(w), our ultimate interest in training, withoutdirectly dealing with the difficult partition function Z(w). The flattened structure is shownat the bottom of the Figure 2. Specifically, the approximate posterior q(w) has the followingform

q(w) ∝ p0(w)∏

k∈E gk(w)∏k∈E fk(w)

. (36)

9

Equation (36) uses the approximation term fk(w) for each edge rather than using Z(w).It is also possible to interpret the flattened structure from the perspective of the two-levelapproximation structure. That is, each time we partially update Z(w) based on only oneindividual term approximation fk(w), and then refine q(w) before updating Z(w) againbased on another individual term approximation.

With the flattened structure, we update and incorporate gk(w) using standard EP. Also,we have the same updates for q(t) as described in the previous sections. To update andincorporate fk(w) in the denominator, we can use either power EP or transformed EP fortraining. Specifically, by inserting sk(w) = 1/fk(w) and nk = −1 into equations (13) to(15) and (21) to (23) respectively, we obtain the PEP and transformed PEP updates forfk(w). Setting mk = 0 in transformed PEP gives the transformed EP updates.

As a result of structure flattening, we have the following advantages over the previous twoapproaches. First, in our experiments, training can converge easily. Instead of iterativelyapproximating Z(w) in the lower level based on all the individual terms as before, a partialupdate based on only one individual term makes training much more stable. Second, wecan obtain a better “leave-one-out” posterior q\k(w), since we update q(w) more frequentlythan in the previous two cases. A more refined q(w) leads to a better q\k(w), which in turnguides the KL minimization to find a better new q(w). Finally, the flattened approximationstructure allows us to process approximation terms in any order. We found empiricallythat, instead of using a random order, it is better to process the denominator term fk(w)directly after processing the numerator term gk(w) that is associated with the same edgeas fk(w).

Using the flattened structure and pairing the processing of the corresponding numeratorand denominator terms, we can train BCRFs robustly. For example, on the tasks of ana-lyzing synthetic datasets in the experimental section, training with the two-level structurebreaks down by skipping Z(w) or fk(w) and fails to converge. In contrast, trainingwith the flattened structure converges successfully.

We can compute both the power EP and transformed EP updates in O(d2) time as shownin the following section, while a simple implementation of the two-level approximation wouldtake O(d3) time due to the inverse of the covariance matrix of Z(w).

3.4 Efficient low-rank matrix computation

This section presents low-rank matrix computation for the new approximate posterior dis-tributions, the new approximation terms, and the “leave-one-out” posterior distributionsfor the flattened approximation structure.

3.4.1 Computing moments

First, let us define φk(m,n,x) as shorthand of φk(ti = m, tj = n,x), where φk(ti, tj ,x)are feature vectors extracted at edge k = i, j with labels ti and tj on nodes i and j,

10

respectively. Then we have

Ak =

φk(1, 1,x) 0 . . . 00 φk(1, 2,x) 0 . . .0 . . . 0 φk(T, T,x)

(37)

y = ATk w (38)

fk(y) =∑ti,tj

Ψ(yti,tj )q\k(ti)q\k(tj), (39)

where yti,tj = wTφk(ti, tj ,x), and Ψ(·) is the probit function. Clearly, we have q(w) =proj

[fk(y)q\k(w)

]for both PEP and TEP. The calculation of the “leave-one-out” posteri-

ors, q\k(w), q\k(ti), and q\k(tj), will be described in Section 3.4.4. Note that Ak is a T 2

by LT 2 matrix, where L is the length of the feature vector φk. And y is a T 2 by 1 vector,which is shorter than w, whose length is d = LT 2, especially when we have a lot of featuresin φk. In other words, the exact term fk(y) only constrains the distribution q(w) in asmaller subspace. Therefore, it is sensible to use low-rank matrix computation to obtainmw and Vw, the mean and the variance of q(w), based on the “leave-one-out” posteriorq\k(w) ∼ N (mw

\k,Vw\k).

Appendix A details the derivations of the low-rank matrix computation for obtainingnew moments. Here, we simply give the updates:

mw = m\kw + V\k

w Akc (40)

Vw = V\kw −V\k

w AkDATk V\k

w (41)

c = (V\ky )−1

(my −m\k

y

)(42)

D = (V\ky )−1 − (V\k

y )−1(Gy −mymTy )(V\k

y )−1, (43)

where

my =∫

fk(y)yN (y | m\ky ,V\k

y ) dyZ

(44)

=

∑ti,tj

Zti,tjmyti,tjq\k(ti)q\k(tj)

Z(45)

Gy =∫

fk(y)yyTN (y | m\ky ,V\k

y ) dyZ

(46)

=

∑ti,tj

Zti,tjGyti,tjq\k(ti)q\k(tj)

Z(47)

Z =∑ti,tj

q\k(ti, tj)∫

Ψ(yti,tj )N (y | m\ky ,V\k

y ) dy (48)

=∑ti,tj

q\k(ti)q\k(tj)Zti,tj , (49)

11

where

Zti,tj =∫

Ψ(yi,j)N (y | m\ky ,V\k

y ) dy (50)

=∫

Ψ(eTk y)N (y | m\k

y ,V\ky ) dy (51)

= ε + (1− 2ε)Ψ(zti,tj ) (52)

zti,tj =eT

k m\ky√

eTk V\k

y ek + 1(53)

ρti,tj =1√

eTk V\k

y ek + 1

(1− 2εN (zti,tj | 0, 1))ε + (1− 2ε)Ψ(zti,tj )

(54)

myti,tj=

∫Ψ(eT

k y)yN (y | m\ky ,V\k

y ) dyZti,tj

(55)

= m\ky + V\k

y ρti,tjek (56)

Gyti,tj= V\k

y −V\ky ek

(ρti,tj (eTk myti,tj

+ ρti,tj )

eTk V\k

y ek + 1

)eT

k V\ky , (57)

where ek is a vector with all elements being zeros except its kth element being one.We update q(ti) and q(tj) as follows:

q(ti, tj) =Zti,tjq

\k(ti)q\k(tj)Z

(58)

q(ti) =∑tj

q(ti, tj), q(tj) =∑ti

q(ti, tj). (59)

3.4.2 Computing new approximation terms

Given the moment matching update, it is easy to compute the approximation term fk(w) ∝q(w)/q\k(w) in PEP and TEP. Since both q(w) and q\k(w) are Gaussians, fk(w) also hasthe form of a Gaussian (though its covariance may not be positive definite). Suppose Hk

and τ k are the natural parameters of the Gaussian fk(w); Hk is the inverse of its covariancematrix and τ k is the product of the inverse of its covariance matrix and its mean vector.Hk and τ k can be computed as follows:

Hk = (Vw)−1 − (V\kw )−1 = AkhkAT

k (60)

hk =(D−1 −AT

k V\kw Ak

)−1 (61)

τ k = (Vw)−1mw − (V\kw )−1m\k

w = Akµk (62)

µk = c + hkATk mw, (63)

where c and D are defined in equations (42) and (43). Instead of computing Hk and τ k,we only need to compute the projected low-rank matrix hk and the projected vector µk.This saves both computation time and memory. For example, if the range of t is 1, 2,

12

then Ak is a d by 4 matrix where d is the length of w. Since we often have many featuresextracted from observations, d tends to be a large number. In this case, Hk is a huge d byd matrix and τ k is a long d by 1 vector, while hk is only a 4 by 4 matrix and τ k is a 4 by1 vector.

To help the algorithm converge, we use a step size λ between 0 and 1 to update theapproximation term. Then the new approximation term fk(w)new has the parameters hk

and µk:

fk(w)new = fk(w)λfk(w)1−λ (64)

hk = λhk + (1− λ)holdk (65)

µk = λµk + (1− λ)µoldk , (66)

where holdk and µold

k are the parameters of the old approximation term fk(w).Similarly, we compute the approximation terms fk(ti) and fk(tj) for the discrete labels

as follows:

f(ti) = q(ti)/q\k(ti) f(tj) = q(tj)/q\k(tj). (67)

We use a step size ξ to damp the update for the discrete distribution q(t). Notice thatthe approximate distributions of the parameters and the labels, q(w) and q(t), depend oneach other. The new q(w) or q(t) relies on both the old q(w) and the old q(t). Especially,a crude approximation q(t) can directly affect the quality of the new mean parameters ofq(w). Therefore, we set the step size ξ for q(t) to be smaller than the step size λ for q(w),such that a more refined q(w) is obtained before updating q(t).

Given the step size ξ, the new approximation terms for labels, fk(ti)new and fk(tj)new,can be computed as follows:

fk(ti)new = f(ti)ξ(fk(ti))1−ξ fk(tj)new = f(tj)ξ(fk(tj))1−ξ, (68)

where fk(ti) and fk(tj) are the old approximation terms for edge k = i, j.

3.4.3 Including approximation terms into posterior

Given the new approximation term fk(w)new, PEP updates the posterior q(w) ∼ N (mw,Vw)as follows:

q(w) = q\k(w)(fk(w))2/fk(w)new (69)

ξk = 2holdk − hk (70)

ηk = 2µoldk − µk (71)

Vw = V\kw −V\k

w Ak(ξ−1k + AT

k V\kw Ak)−1AT

k V\kw (72)

mw = m\kw −V\k

w Ak(ξ−1k + AT

k V\kw Ak)−1AT

k m\kw + VwAkηk. (73)

Given fk(w)new, TEP updates the posterior q(w) ∼ N (mw,Vw) as follows:

q(w) = q\k(w)/fk(w)new (74)

Vw = V\kw −V\k

w Ak(−h−1k + ATV\k

w A)−1ATk V\k

w (75)

mw = m\kw −V\k

w Ak(−h−1k + ATV\k

w A)−1ATk m\k

w −VwAkµk. (76)

13

Also, given fk(ti)new and fk(tj)new, the new marginal distributions, q(ti) and q(tj), canbe computed as follows:

q(ti) = q\k(tj)fk(ti)new q(tj) = q\k(tj)fk(tj)new. (77)

3.4.4 Deleting old approximation terms

For PEP, the parameters of the “leave-one-out” posterior q\k(w) can be efficiently computedas follows:

q\k(w) = q(w)/fk(w) (78)

V\kw = (V−1

w −AkhkATk )−1 (79)

= Vw + VwAk(h−1k −AT

k VwAk)−1(VwAk)T (80)

m\kw = (V\k

w )−1(V−1w mw − uk) (81)

= mw + V\kw Ak(hkAT

k mw − µk). (82)

For TEP, q\k(w) can be efficiently computed as follows:

q\k(w) = q(w)fk(w) (83)

V\kw = (V−1

w + AkhkATk )−1 (84)

= Vw −VwAk(h−1k + AT

k VwAk)−1(VwAk)T (85)

m\kw = (V\k

w )−1(V−1w mw + uk) (86)

= mw + V\kw Ak(µk − hkAT

k mw). (87)

Similarly, q\k(t) can be computed as follows:

q\k(tj) = q(ti)/fk(ti) (88)

q\k(tj) = q(tj)/fk(tj). (89)

4. Inference on BCRFs

Unlike traditional classification problems, where we have a scalar output for each input, aBCRF jointly labels all the hidden vertices in an undirected graph. Given the posterior q(w),we present two ways for inference on BCRFs: Bayesian-point-estimate (BPE) inference andBayesian model averaging. While the BPE inference uses the posterior mean mw as themodel parameter vector for inference, Bayesian model averaging integrates over the posteriorq(w) instead of using a fixed parameter vector. The advantage of model averaging over theBPE approach is that model averaging makes full use of the data by employing not onlythe estimated mean of the parameters, but also the estimated uncertainty (variance). Thebenefit of model averaging, which combines predictions across all parameter settings, isthe elimination of the overfitting problem. Since exact model averaging is intractable, wepropose an efficient way to approximate it. The following sections describe in detail theBPE inference and approximate model averaging for BCRFs.

14

4.1 BPE inference

Given a new test graph x?, a BCRF trained on (x, t) can approximate the predictivedistribution as follows:

p(t? | x?, t,x) =∫

p(t? | x?,w)p(w | t,x) dw (90)

≈ p(t? | x?,mw) (91)

=1

Z(mw)

∏i,j∈E

gi,j(t?i , t?j ,x;mw). (92)

where mw is the mean of the approximate posterior q(w).Given mw, we are interested in finding the maximum marginal probability (MMP)

configuration of the test graph:

tMMPi = argmaxt?i

P (t?i | x?,mw), ∀i ∈ V. (93)

To obtain the MMP configuration, one can use the junction tree algorithm to get the exactmarginals of the labels if the tree width of the graph is not very large. If the tree width isrelatively large, one can use loopy belief propagation or tree-structured EP (Minka & Qi,2003) to approximate the marginals. After obtaining the exact or approximate marginals,then one can find the MMP configuration by choosing the label with the maximal marginalfor each node.

For applications where a single labeling error on a vertex of a graph can dramaticallychange the meaning of the whole graph, such as word recognition, we may want to find themaximum joint probability (MJP) configuration. While the maximum marginal configura-tion will choose the most probable individual labels, the MJP configuration finds a globallycompatible label assignment. Depending on the application, we can use MMP or MJPconfigurations accordingly. In our experiments below, we always find MMP configurations.

4.2 Approximate model averaging

Given a new test graph x?, a BCRF trained on (x, t) can approximate the predictivedistribution as follows:

p(t? | x?, t,x) =∫

p(t? | x?,w)p(w | t,x) dw (94)

≈∫

p(t? | x?,w)q(w) dw (95)

=∫

q(w)Z(w)

∏i,j∈E

gi,j(t?i , t?j ,x

?;w) dw, (96)

where q(w) is the approximation of the true posterior p(w | t,x). Since the exact integralfor model averaging is intractable, we need to approximate this integral.

We can approximate the predictive posterior term by term as in EP. But typical EPupdates will involve the updates over both the parameters and the labels. That is muchmore expensive than using a point estimate for inference, which involves only updates of the

15

labels. To reduce the computational complexity, we propose the following approach. It isbased on the simple fact that without any labels, the test graph does not offer informationabout the parameters in the conditional model. Mathematically, after integrating out thelabels, the posterior distribution q(w) remains the same:

q(w)new =∫

1Z(w)

∏i,j∈E

gi,j(t?i , t?j ,x;w)q(w) dt = q(w). (97)

Since q(w) is unchanged, we can update only q(t?) when incorporating one term gi,j(t?i , t?j ,x;w).

Specifically, given the posterior q(w) ∼ N (mw,Vw), we use the factorized approximationq(t?) =

∏i q(t

?i ) and update q(t?i ) and q(t?j ) as follows:

zt?i ,t?j=

φTk m\k

w√φT

k V\kw φk + 1

(98)

Zt?i ,t?j= ε + (1− 2ε)Ψ(zt?i ,t?j

) (99)

q(t?i , t?j ) =

Zt?i ,t?jq\k(t?i )q

\k(t?j )

Z(100)

q(t?i ) =∑tj

q(t?i , t?j ) (101)

q(t?j ) =∑ti

q(t?i , t?j ), (102)

where φk is the feature vector extracted at the kth edge. Note that the deletion and inclusionsteps for q(t?) are similar to those described in the previous section on BCRF training.

Compared to the BPE inference, the above algorithm is almost the same except for

an importance difference: approximate model averaging uses φTk m

\kw√

φTk V

\kw φk+1

in its potential

function, while the BPE inference simply uses φTk m\k

w . In other words, the approximatemodel averaging algorithm normalizes the potentials by the posterior variance projected onfeature vectors, while the BPE inference does not.

5. Experimental results

This section compares BCRFs with CRFs trained by maximum likelihood (ML) and maxi-mum a posteriori (MAP) methods on several synthetic datasets, a document labeling task,and an ink recognition task, demonstrating BCRFs’ superior test performance. ML- andMAP-trained CRFs include CRFs with probit potential functions (4) and with exponen-tial potential functions (3). We applied both PEP and TEP to train BCRFs with theflattened structure. To avoid divergence, we used a small step size. In the following para-graphs, BCRF-PEP and BCRF-TEP denote CRFs trained by PEP and TEP respectively,with model averaging for inference. BPE-CRF-PEP and BPE-CRF-TEP denote variants ofBCRFs, i.e., CRFs using the BPE inference. For ML- and MAP-trained CRFs, we appliedthe junction tree algorithm for inference. We used probit models as the default potential

16

functions by setting ε = 0 in equation (4), unless we explicitly stated that we used expo-nential potentials. To compare the test performance of these algorithms, we counted thetest errors on all vertices in the test graphs.

5.1 Classifying synthetic data

On synthesized datasets, we examined the test performance of BCRFs and MAP-trainedprobit CRFs for different sizes of training sets and different graphical structures. Thelabels of the vertices in the synthetic graphs are all binary. The parameter vector w has24 elements. The feature vectors φi,j are randomly sampled from one of four Gaussians.We can easily control the discriminability of the data by changing the variance of theGaussians. Based on the model parameter vector and the sampled feature vectors, we cancompute the joint probability of the labels as in equation (1) and randomly sample thelabels. For BCRFs, we used a step size of 0.8 for training. For MAP-trained CRFs, we useda quasi-Newton method with the BFGS approximation of Hessians (Sha & Pereira, 2003).

5.1.1 Different training sizes for loopy CRFs

Each graph has 3 vertices in a loop. In each trial, 10 loops were sampled for training and1000 loops for testing. The procedure was repeated for 10 trials. A Gaussian prior withmean 0 and diagonal variance 5I was used for both BCRF and MAP CRF training. Werepeated the same experiments by increasing the number of training graphs from 10 to 30to 100.

Algorithm Test error rates10 train. 30 train. 100 train.

ML-CRF 27.96 ? 22.80 ? 12.25 ?

MAP-CRF 16.36 ? 12.40 ? 9.60 ?

BPE-CRF-TEP 10.66 9.85 9.21BPE-CRF-PEP 10.67 9.76 9.14

BCRF-TEP 10.51 9.73 9.23BCRF-PEP 10.59 9.71 9.23

Table 1: Test error rates on synthetic datasets with different numbers of training loops. Theresults are averaged over 10 runs. Each run has 1000 test loops. The stars indicatethat the error rates are worse than those of BCRF-PEP at 98% significance.

The results are visualized in Figure 3 and summarized in Table 1. According to t-tests,which have stronger test power than the error bars in Figure 3, both types of BCRFsoutperform ML- and MAP-trained CRFs in all cases at 98% statistical significance level.PEP- and TEP-trained BCRFs achieve comparable test accuracy. Furthermore, approxi-mate model averaging improves the test performance over the BPE inference with 86% and91% statistical significance for the case of 10 and 30 training graphs, respectively. Whenmore training graphs are available, MAP-trained CRFs and BCRFs perform increasinglysimilarly, though still statistically differently. The reason is that the posterior is narrowerthan in the case of fewer training graphs, such that the posterior mode is closer to the pos-

17

10 30 50 100

10

15

20

25

30

Number of Training graphs

Tes

t err

or r

ate

(%)

ML−CRFMAP−CRFBCRF−PEPBCRF−TEP

Figure 3: Test error rates for ML- andMAP-trained CRFs and BCRFson synthetic datasets with dif-ferent numbers of training loops.The results are averaged over 10runs. Each run has 1000 testloops. Non-overlapping of errorbars, the standard errors scaledby 1.64, indicates 95% signifi-cance of the performance differ-ence.

10 30 50 100

22

24

26

28

30

32

34

36

Number of Training graphs

Tes

t err

or r

ate

(%)

ML−CRFMAP−CRFBCRF−PEPBCRF−TEP

Figure 4: Test error rates for ML- andMAP-trained CRFs and BCRFson synthetic datasets with differ-ent numbers of training chains.The results are averaged over10 runs. Each run has 1000test chains. Though the errorbars overlap a little bit, BCRFsstill outperform ML- and MAP-trained CRFs at 95% signifi-cance according to t-tests.

terior mean. In this case, approximate model averaging does not improve test performance,since the inference becomes more sensitive to the error introduced by the approximation inmodel averaging.

5.1.2 Different training sizes for chain-structured CRFs

We then changed the graphs to be chain-structured. Specifically, each graph has 3vertices in a chain. In each trial, 10 chains were sampled for training and 1000 chainsfor testing. The procedure was repeated for 10 trials. A Gaussian prior with mean 0and diagonal variance 5I was used for both BCRF and MAP CRF training. Then werepeated the same experiments by increasing the number of training graphs from 10 to 30to 100. The results are visualized in Figure 4 and summarized in Table 2. Again, BCRFsoutperform ML- and MAP-trained CRFs with high statistical significance. Also, modelaveraging significantly improves the test accuracy over the BPE inference, especially whenthe size of training sets is small.

18

Algorithm Test error rates10 train. 30 train. 100 train.

ML-CRF 32.54 ? 30.36 ? 23.09 ?

MAP-CRF 29.12 ? 24.39 ? 21.48BPE-CRF-TEP 25.74 ? 23.06 ? 21.42BPE-CRF-PEP 25.48 ? 22.52 ? 21.34

BCRF-TEP 25.01 22.53 ? 21.41 ?

BCRF-PEP 24.94 22.21 21.27

Table 2: Test error rates for ML- and MAP-trained CRFs and BCRFs on synthetic datasetswith different numbers of training chains. The results are averaged over 10 runs.Each run has 1000 test chains. The stars indicate that the error rates are worsethan those of BCRF-PEP at 98% significance.

5.2 Labeling FAQs

We compared BCRFs with MAP-trained probit and exponential CRFs on the frequentlyasked questions (FAQ) dataset, introduced by McCallum et al. (2000). The dataset consistsof 47 files, belonging to 7 Usenet newsgroup FAQs. Each file has multiple lines, which canbe the header (H), a question (Q), an answer (A), or the tail (T). Since identifying theheader and the tail is relatively easy, we simplify the task to label only the lines that arequestions or answers. To save time, we truncated all the FAQ files such that no file has morethan 500 lines. On average, the truncated files have 479 lines. The dataset was randomlysplit 10 times into 19 training and 28 test files. Each file was modeled by a chain-structuredCRF, whose vertices correspond to lines in the files. The feature vector for each edge ofa CRF is simply the concatenation of feature vectors extracted at the two neighboringvertices, which are the same set of 24 features as McCallum et al. (2000) used, for example,begins-with-number, begins-with-question-word, and indented-1-to-4.

The test performance is summarized in Table 3 and visualized in Figure 5. Accordingto t-tests, BCRFs outperform ML- and MAP-trained CRFs with probit or exponentialpotentials on the truncated FAQs dataset at 98% statistical significance level.

Finally, we compared PEP- and TEP-trained BCRFs. On one hand, Figures 6 and 7show that TEP-trained BCRFs converged steadily, while the change in BCRF parametersin PEP training oscillated. On the other hand, as shown in Figure 5 and Table 3, PEP-trained BCRFs achieved more accurate test performance than TEP-trained BCRFs, thoughTEP-trained BCRFs also outperformed MAP-trained CRFs.

5.3 Parsing handwritten organization charts

We compared BCRFs and MAP-trained CRFs on hand-drawn diagram recognition, a real-world application of computer vision. The goal is to distinguish containers from connectorsin drawings of organization charts. A typical organization chart is shown in Figure 8. CRFsenable joint classification of all ink on a page. The classification of one ink fragment caninfluence the classification of others, so that context is exploited.

We break the task into three steps:

19

BCRF−PEP

BCRF−TEP

MAP−EXP−CRF

MAP−CRF

0.5 1.0 1.5

Test error rate (%)

Figure 5: Test error rates of different algorithms on FAQ dataset. The results are averagedover 10 random splits. The error bars are the standard errors multiplied by 1.64.Both types of BCRFs outperform MAP-trained CRFs with probit and expontialpotentials at 95% statistical significance level, though the error bar of BCRF-TEPoverlaps a little bit with those of MAP-trained CRFs.

20 25 30 35 40 45 50 55 600.26

0.28

0.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

iterations

||wt−

wt−

1|| 1

Figure 6: Change in parameters along iter-ations of PEP. Though decreas-ing overall, the change in param-eters oscillates. This slows downthe convergence speed.

20 25 30 35 40 45 50 55 600.105

0.11

0.115

0.12

0.125

0.13

0.135

0.14

0.145

iterations

||wt−

wt−

1|| 1

Figure 7: Change in parameters along it-erations of TEP. The changein parameters monotonically de-creases until the training con-verges.

1. Subdivision of pen strokes into fragments,

2. Construction of a conditional random field on the fragments,

3. Training and inference on the network.

The input is electronic ink recorded as sampled locations of the pen, and collected intostrokes separated by pen-down and pen-up events. In the first step, strokes are dividedinto simpler components called fragments. Fragments should be small enough to belongto a single container or connector. In contrast, strokes can occasionally span more thanone shape, for example when a user draws a container and a connector without lifting the

20

Algorithm Test error ratesMAP-CRF 1.42 ?

MAP-Exp-CRF 1.36 ?

BPE-CRF-TEP 0.68BPE-CRF-PEP 0.72 ?

BCRF-TEP 0.67BCRF-PEP 0.54

Table 3: Test error rates on FAQ dataset. The stars indicate the error rates are worse thanthat of BCRF-PEP with model averaging at 98% significance. 10 random splits ofthe dataset were used. Each split has 19 sequences for training and 28 for testing.The average length of the truncated FAQ chains is 479 lines.

1

2

3

45

6

7

89

10

11

12

13

14 15

16

17

18

19

20

21

22

23

24

25

26

2728

29

30

31

32

33

34

35

36

37

38

39

Figure 8: An organization chart. Stroke fragments are numbered and their endpointsmarked. In the right middle region of the chart, fragments 18, 21, 22, and 29from both containers and connectors compose a box, which however is not acontainer. This region is locally ambiguous, but globally unambiguous.

pen. One could choose the fragments to be individual sampled dots of ink, however, thiswould be computationally expensive. We choose fragments to be groups of ink dots withina stroke that form straight line segments (within some tolerance) (Figure 8).

In the second step, we construct a conditional random field on the fragments. Each inkfragment is represented by a vertex in the graph (Figure 9). The vertex has an associatedlabel variable ti, which takes on the values 1 (container) or 2 (connector). Potential functionsquantify the relations between labels given the data. The features used in each potentialfunction include histograms of angles and lengths, as well as templates to detect importantconfigurations, e.g., T-junctions formed by two neighboring fragments.

Finally, we train the constructed CRFs and then predict labels of new organizationcharts using the parameter posteriors from the rained CRFs.

The data set consists of 23 organization charts. We randomly split them 10 times into 14training and 9 test graphs, and ran BCRFs, ML- and MAP-trained CRFs on each partition.We used the same spherical Gaussian prior p0(w) ∼

∏iN (wi | 0, 5) on the parameters w

21

Figure 9: A CRF superimposed on part of the chart from Figure 8. There is one vertex(shown as circle) per fragment, and edges indicate potentials between neighboringfragments.

Figure 10: Parsing the organization chart by BCRF-PEP. The trained BCRF-PEP labelsthe containers as bold lines and labels connectors as thin lines. Clearly, byexploiting contextual information in the graph, the BCRF-PEP accurately labelsthe graph even in the difficult region to the middle-right, where there is localambiguity.

for both BCRF and MAP CRF training. The same features were also used. Figure 10 showsthe parsed organization chart in Figure 8. The overall test performance is summarized inTable 4. Again, BCRFs outperforms the other approaches.

5.4 Comparing computational complexity

In general, the computational cost of ML and Bayesian training depends on many fac-tors. On the one hand, for ML and MAP training, the cost of the BFGS algorithm isO(d maxd, |E|) per iteration, where d is the length of w, and |E| is the total number ofedges in training graphs. The cost of BCRF training by PEP and TEP is O(|E|d2) per it-eration. Therefore BCRF training is about as mind, |E| times expensive as ML and MAP

22

Algorithm Test error ratesML-CRF 10.7± 1.21 ?

MAP-CRF 6.00± 0.64 ?

ML-Exp-CRF 10.1± 1.21 ?

MAP-Exp-CRF 5.20± 0.79 BPE-CRF-PEP 5.93± 0.59 ?

BCRF-PEP 4.39± 0.62

Table 4: Test error rates on organization charts. The results are averaged over 10 randompartitions. The stars indicate that the error rates are worse than that of BCRF-TEP at 98% significance. The diamond indicates that BCRF-PEP outperformsMAP-trained CRFs with exponential potentials at 85% significance.

training per iteration. On the other hand, BCRF training generally takes much fewer iter-ations to converge than BFGS training, using comparable stopping criteria. For example,even with one iteration of BCRF training, we can obtain a decent approximate posteriorof the parameters, while the estimate after the first BFGS iteration will be far from thetrue ML and MAP estimates. Moreover, in BFGS training there is an embedded inferenceproblem to obtain the needed statistics for optimization. This inference problem can berelatively expensive and cause a big hidden constant in O(d maxd, |E|), the BFGS costper iteration. For example, for loopy graphs where an iterative approximate inference algo-rithm is used, we have additional iterations embedded in each BFGS iteration. For BCRFtraining, however, with flattened approximation structures, we do not have embedded it-erations. On synthetic data where the number of edges is limited, BCRF training is atleast as efficient as BFGS training. For example, given the 10 training sets in Section 5.1.2,each of which has 30 chain-structured CRFs, PEP, TEP, and BFGS training on a Pentium 43.1GHz computer used 8.81, 5.58, and 21.16 seconds on average, respectively. On real-worlddata, many factors play together to determine the efficiency of a training method. On arandom split of the FAQ dataset where the total number of edges in graphs is more than8000, it took about 9 and 2 hours for BFGS to train probit and exponential CRFs, whileit took PEP and TEP training about 6 and 4 hours, respectively. On both synthetic andreal-world datasets, TEP were more efficient than PEP for BCRF training.

6. Conclusions

This paper has presented BCRFs, a new approach to training and inference on conditionalrandom fields. In training, BCRFs use the TEP method or the PEP method to approximatethe posterior distribution of the parameters. Also, BCRFs flatten approximation structuresto increase the algorithmic stability, efficiency, and prediction accuracy. In testing, BCRFsapproximate model averaging. On synthetic data, FAQ files, and handwritten ink recogni-tion, we compared BCRFs with ML- and MAP-trained CRFs. In almost all the experiments,BCRFs outperformed ML- and MAP-trained CRFs significantly. Also, approximate modelaveraging enhances the test performance over the BPE inference. Regarding the differencebetween PEP and TEP training, PEP-trained BCRFs achieve comparable or more accurate

23

test performance than TEP-trained BCRFs, while TEP training tends to have a faster con-vergence speed than PEP training. Moreover, we need a better theoretical understandingof TEP.

Compared to ML- and MAP-trained CRFs, BCRFs approximate model averaging overthe posterior distribution of the parameters, instead of using a MAP or ML point estimateof the parameter vector for inference. Furthermore, BCRF hyperparameters, e.g., the priorvariance for the parameters w, can be optimized in a principled way, such as by maximizingmodel evidence, with parameters integrated out. Similarly, we can use predictive automaticrelevance determination (Qi et al., 2004) to do feature selection with BCRFs and to obtainsparse kernelized BCRFs.

Finally, the techniques developed for BCRFs have promise for other machine learningproblems. One example is Bayesian learning in Markov random fields. If we set all featuresφk’s of a CRF to be 1, then the CRF will reduce to a Markov random field. Due to thisconnection, BCRF techniques can be useful for learning Markov random fields. Anotherexample is Bayesian multi-class discrimination. If we have only one edge in a BCRF, thenlabeling the two neighboring vertices amounts to classifying the edge into multiple classes.Therefore, we can directly use BCRF techniques for multi-class discrimination by treatingeach data point as an edge in a CRF with two vertices, which encode the label of the originaldata point.

Acknowledgment

Y. Qi was supported by the Things That Think consortium at the MIT Media laboratory beforehe moved to CSAIL. Thanks to Andrew McCallum for providing the features used in the FAQdataset and Fei Sha for offering the maximum likelihood training code for CRFs with exponentialpotentials.

24

Appendix A. Low-rank matrix computation for first and second ordermoments

Define y = ATk w, m\k

y = ATk m\k

w , and Vy = ATk V\k

w Ak. It follows that

∇m log Z(m\kw ,V\k

w ) = ∇m log∫

f(ATk w)N (w|m\k

w ,V\kw ) dw (103)

=1Z

∫∇mf(AT

k w)N (w|m\kw ,V\k

w ) dw (104)

=1Z

∫f(y)∇mN (y|m\k

y ,V\ky ) dy (105)

=1Z

∫f(y)(y −m\k

y )T(V\ky )−1N (y|m\k

y ,V\ky )∇mm\k

y dy (106)

=1Z

∫f(y)(y −m\k

y )TN (y|m\ky ,V\k

y )(V\ky )−1AT

k dy (107)

= cTATk (108)

where c = (V\ky )−1(my −m\k

y ), and my = 1Z

∫f(y)yN (y|m\k

y ,V\ky ) dy is the new mean of

y. Also, we have

∇v log Z(m\kw ,V\k

w ) =1Z

∫f(y)∇vN (y|m\k

y ,V\ky ) dy (109)

= −12Ak

1Z

∫f(y)N (y|m\k

y ,V\ky )

((V\k

y )−1−

(V\ky )−1(y −m\k

y )(y −m\ky )T(V\k

y )−1)AT

k dy (110)

= −12Ak

((V\k

y )−1 − (V\ky )−1(Gy − 2my(m\k

y )T+

m\ky (m\k

y )T)(V\ky )−1

)AT

k , (111)

where Gy = 1Z

∫f(y)yyTN (y|m\k

y ,V\ky ) dy is the new variance of y.

From (108) and (111), it follows that

(∇m∇Tm − 2∇v) log Z(m\k

w ,V\kw ) = AkDAT

k , (112)

where

D = (V\ky )−1 − (V\k

y )−1(Gy −mymTy )(V\k

y )−1. (113)

The new mean and variance of the parameters w are

mw = m\kw + V\k

w ∇Tm log Z(m\k

w ,V\kw )

)(114)

= m\kw + V\k

w Akc (115)

Vw = V\kw −V\k

w (∇m∇Tm − 2∇v) log Z(m\k

w ,V\kw )V\k

w (116)

= V\kw −V\k

w AkDATk V\k

w . (117)

25

References

Amari, S. (1985). Differential-geometrical methods in statistics. Springer-Verlag.

Herbrich, R., Graepel, T., & Campbell, C. (1999). Bayes point machine: Estimating theBayes point in kernel space. IJCAI Workshop SVMs (pp. 23–27).

Kumar, S., & Hebert, M. (2004). Discriminative fields for modeling spatial dependenciesin natural images. Advances in Neural Information Processing Systems 16.

Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilisticmodels for segmenting and labeling sequence data. Proc. 18th International Conf. onMachine Learning (pp. 282–289). Morgan Kaufmann, San Francisco, CA.

Lafferty, J., Zhu, X., & Liu, Y. (2004). Kernel conditional random fields: Representationand clique selection. Proc. International Conf. on Machine Learning.

McCallum, A. (2003). Efficiently inducing features of conditional random fields.Proceedings of the 16th Conference in Uncertainty in Artificial Intelligence.

McCallum, A., Freitag, D., & Pereira, F. (2000). Maximum entropy Markov models forinformation extraction and segmentation. Proc. 17th International Conf. on MachineLearning (pp. 591–598). Morgan Kaufmann, San Francisco, CA.

Minka, T., & Qi, Y. (2003). Tree-structured approximations by expectation propagation.Advances in Neural Information Processing System 16. MIT Press.

Minka, T. P. (2001). Expectation propagation for approximate Bayesian inference.Uncertainty in AI’01. http://www.stat.cmu.edu/~minka/papers/ep/.

Minka, T. P. (2004). Power EP. http://www.stat.cmu.edu/~minka/.

Minka, T. P., & Lafferty, J. (2002). Expectation propagation for the generative aspectmodel. Proc UAI.

Murray, I., & Ghahramani, Z. (2004). Bayesian learning in undirected graphical models:Approximate MCMC algorithms. UAI.

Qi, Y., Minka, T. P., Picard, R. W., & Ghahramani, Z. (2004). Predictive automaticrelevance determination by expectation propagation. Proceedings of Twenty-firstInternational Conference on Machine Learning.

Sha, F., & Pereira, F. (2003). Parsing with conditional random fields (Technical ReportMS-CIS-02-35). University of Pennsylvania.

Taskar, B., Guestrin, C., & Koller, D. (2004). Max-margin Markov networks. In S. Thrun,L. Saul and B. Scholkopf (Eds.), Advances in neural information processing systems 16.

Wiegerinck, W., & Heskes, T. (2002). Fractional belief propagation. NIPS 15.

26

Bayesian Conditional Random Fields -...

Documents

Transcript of Bayesian Conditional Random Fields -...