# Scalable Variational Gaussian Process Classi cationproceedings.mlr.press/v38/hensman15.pdfScalable...

date post

05-Aug-2020Category

## Documents

view

12download

0

Embed Size (px)

### Transcript of Scalable Variational Gaussian Process Classi cationproceedings.mlr.press/v38/hensman15.pdfScalable...

Scalable Variational Gaussian Process Classification

James Hensman Alexander G. de G. Matthews Zoubin GhahramaniDepartment of Computer Science

The University of SheffieldDepartment of Engineering

The University of CambridgeDepartment of Engineering

The University of Cambridge

Abstract

Gaussian process classification is a popularmethod with a number of appealing proper-ties. We show how to scale the model withina variational inducing point framework, out-performing the state of the art on benchmarkdatasets. Importantly, the variational formu-lation can be exploited to allow classificationin problems with millions of data points, aswe demonstrate in experiments.

1 Introduction

Gaussian processes (GPs) provide priors over functionsthat can be used for many machine learning tasks. Inthe regression setting, when the likelihood is Gaussian,inference can be performed in closed-form using linearalgebra. When the likelihood is non-Gaussian, such asin GP classification, the posterior and marginal like-lihood must be approximated. Kuss and Rasmussen[2005] and Nickisch and Rasmussen [2008] provide ex-cellent comparisons of several approximate inferencemethods for GP classification. More recently, Opperand Archambeau [2009] and Khan et al. [2012] consid-ered algorithmic improvements to variational approx-imations in non-conjugate GP models.

The computational cost of inference in GPs is O(N3)in general, where N is the number of data. In theregression setting, there has been much interest inlow-rank or sparse approaches to reduce this computa-tional complexity: Quiñonero-Candela and Rasmussen[2005] provides a review. Many of these approximationschemes rely on the use of a series of inducing points,which can be difficult to select. Titsias [2009] sug-gests a variational approach (see section 2) which pro-vides an objective function for optimizing these points.

Appearing in Proceedings of the 18th International Con-ference on Artificial Intelligence and Statistics (AISTATS)2015, San Diego, CA, USA. JMLR: W&CP volume 38.Copyright 2015 by the authors.

This variational idea was extended by Hensman et al.[2013], who showed how the variational objective couldbe reformulated with additional parameters to enablestochastic optimization, which allows GPs to be fittedto millions of data.

Despite much interest in approximate GP classifica-tion and sparse approximations, there has been lit-tle overlap of the two. The approximate inferenceschemes which deal with the non-conjugacy of the like-lihood generally scale with O(N3), as they require fac-torization of the covariance matrix. Two approacheswhich have addressed these issues simultaneously arethe IVM [Lawrence et al., 2003] and Generalized FITC[Naish-Guzman and Holden, 2007] which both haveshortcomings as we shall discuss.

It is tempting to think that sparse GP classification issimply a case of combining a low-rank approximationto the covariance with one’s preferred non-conjugateapproximation, but as we shall show this does not nec-essarily lead to an effective method: it can be verydifficult to place the inducing input points, and scala-bility of the method is usually restricted by the com-plexity of matrix-matrix multiplications.

For these reasons, there is a strong case for a non-conjugate sparse GP scheme which provides a vari-ational bound on the marginal likelihood, combiningthe scalability of the stochastic optimization approachwith the ability to optimize the positions of the induc-ing inputs. Furthermore, a variational approach wouldallow for integration of the approximation within otherGP models such as GP regression networks [Wilsonet al., 2012], latent variable models [Lawrence, 2004]and deep GPs [Damianou and Lawrence, 2013], as thevariational objective could be optimized as part of sucha model without fear of overfitting.

The rest of the paper is arranged as follows. In sec-tion 2, we briefly cover some background materialand existing work. In section 3, we show how vari-ational approximations to the covariance matrix canbe used post-hoc to provide variational bounds withnon-Gaussian likelihoods. These approaches are notentirely satisfactory, and so in section 4 we provide a

351

Scalable Variational Gaussian Process Classification

variational approach which does not first approximatethe covariance matrix, but delivers a variational bounddirectly. In section 5 we compare our proposals withthe state of the art, as well as demonstrating empiri-cally that our preferred method is applicable to verylarge datasets through stochastic variational optimiza-tion. Section 6 concludes.

2 Background

Gaussian Process Classification Gaussian pro-cess priors provide rich nonparametric models of func-tions. To perform classification with this prior, theprocess is ‘squashed’ through a sigmoidal inverse-linkfunction, and a Bernoulli likelihood conditions thedata on the transformed function values. See Ras-mussen and Williams [2006] for a review.

We denote the binary class observations as y ={yn}Nn=1, and then collect the input data into a de-sign matrix X = {xn}Nn=1. We evaluate the covariancefunction at all pairs of input vectors to build the co-variance matrix Knn in the usual way, and arrive ata prior for the values of the GP function at the inputpoints: p(f) = N (f |0,Knn).We denote the probit inverse link function asφ(x) =

∫ x−∞N (a | 0, 1)da and the Bernoulli distribu-

tion B(yn |φ(fn)) = φ(fn)yn(1−φ(fn))1−yn . The jointdistribution of data and latent variables becomes

p(y, f) =N∏n=1

B(yn |φ(fn)) N (f |0,Knn) . (1)

The main object of interest is the posterior over func-tion values p(f |y), which must be approximated. Wealso require an approximation to the marginal likeli-hood p(y) in order to optimize (or marginalize) pa-rameters of the covariance function. An assortment ofapproximation schemes have been proposed (see Nick-isch and Rasmussen [2008] for a comparison), but theyall require O(N3) computation.

Sparse Gaussian Processes for Regression Thecomputational complexity of any Gaussian processmethod scales with O(N3) because of the need to in-vert the covariance matrix K. To reduce the com-putational complexity, many approximation schemeshave been proposed, though most focus on regressiontasks, see Quiñonero-Candela and Rasmussen [2005]for a review. Here we focus on inducing point meth-ods [Snelson and Ghahramani, 2005], where the latentvariables are augmented with additional input-outputpairs Z,u, known as ‘inducing inputs’ and ‘inducingvariables’.

The random variables u are points on the function in

exactly the same way as f , and so the joint distributioncan be written

p(f ,u) = N([

fu

] ∣∣∣0, [ Knn KnmK>nm Kmm

])(2)

where Kmm is formed by evaluating the covariancefunction at all pairs of inducing inputs points zm, zm′ ,and Knm is formed by evaluating the covariance func-tion across the data input points and inducing inputspoints similarly. Using the properties of a multivariatenormal distribution, the joint can be re-written as

p(f ,u) = p(f |u)p(u) (3)= N (f |KnmK−1mmu,Knn −Qnn)N (u |0, Kmm)

with Qnn = KnmK−1mmK

>nm. The joint distribution

now takes the form

p(y, f ,u) = p(y | f)p(f |u)p(u) . (4)To obtain computationally efficient inference, integra-tion over f is approximated. To obtain the popularFITC method (in the case of Gaussian likelihood), afactorization is enforced: p(y |u) ≈ ∏n p(yn |u). Toget a variational approximation, the following inequal-ity is used

log p(y |u) ≥ Ep(f |u) [log p(y | f)] , log p̃(y |u) . (5)Substituting this bound on the conditional into thestandard expression p(y) =

∫p(y|u)p(u)du gives a

tractable bound on the marginal likelihood for theGaussian case [Titsias, 2009]:

log p(y) ≥ logN (y |0,KnmK−1mmK>nm + σ2I)

− 12σ2

tr(Knn −Qnn) ,(6)

where σ2 is the variance of the Gaussian likelihoodterm. This bound on the marginal likelihood can thenbe used as an objective function in optimizing thecovariance function parameters as well as the induc-ing input points Z. The bound becomes tight whenthe inducing points are the data points: Z = X soKnm = Kmm = Knn and (6) becomes equal the truemarginal likelihood logN (y |0, Knn + σ2I).Computing this bound (6) and its derivatives costsO(NM2). A more computationally scalable boundcan be achieved by introducing additional variationalparameters [Hensman et al., 2013]. Noting that (6) im-plies an approximate posterior p̃(u |y), we introducea variational distribution q(u) = N (u |m,S) to ap-proximate this distribution, and applying a standardvariational bound, obtain:

log p(y) ≥ logN (y |KnmK−1mmm, σ2I)

− 12σ2

tr(KnmK−1mmSK

−1mmKmn) (7)

− 12σ2

tr(Knn −Qnn)−KL[q(u)||p(u)] .

352

James Hensman, Alexander G. de G. Matthews, Zoubin Ghahramani

This bound has a unique optimum in terms of the vari-ational parameters m,S, at which point it is tight tothe original sparse GP bound (6). The advantage ofthe representation in (7) is that it can be optimized ina stochastic [Hensman et al., 2013] or distributed [Daiet al., 2014, Gal et al., 2015] fashion.

Of course for the Bernoulli likelihood, the required in-tegrals for (6) and (7) are not tractable, but we willbuild on them both in subsequent sections to buildsparse GP classifiers.

Related work The informative vector machine(IVM) [Lawrence et al., 2003] is the first work to ap-proach sparse GP classification to our knowledge. Theidea is to combine assumed density filtering with a se-lection heuristic to pick points from the data X to actas inducing points Z: the inducing variables u are thena subset of the latent function variables f .

The IVM offers superior performance to support vectormachines [Lawrence et al., 2003], along with a proba-bilistic interpretation. However we might expect bet-ter performance by relaxing the condition that the in-ducing points be a sub-set of the data, as is the case forregression [Quiñonero-Candela and Rasmussen, 2005].

Subsequent work on sparse GP classification [Naish-Guzman and Holden, 2007] removed the restriction ofselecting Z to be a subset of the data X, and ostensiblyimproved over the assumed density filtering scheme byusing expectation propagation (EP) for inference.

Naish-Guzman and Holden [2007] noted that when us-ing the FITC approximation for a Gaussian likelihood,the equivalent prior (see also Quiñonero-Candela andRasmussen [2005]) is

p(f) ≈ N (f |0, Qnn + diag(Knn −Qnn)) . (8)

The Generalized FITC method combines this approx-imate prior with a Bernoulli likelihood and uses EPto approximate the posterior. The form of the priormeans that the linear algebra within the EP updates issimplified, and a round of updates costs O(NM2). EPis nested inside an optimization loop, where the covari-ance hyper-parameters and inducing inputs are opti-mized against the EP approximation to the marginallikelihood. Computing the marginal likelihood approx-imation and gradients costs O(NM2).The generalized FITC method works well in practise,often finding solutions which are as good as or bet-ter than the IVM, with fewer inducing inputs points[Naish-Guzman and Holden, 2007].

Discussion Despite performing significantly betterthan the IVM, GFITC does not satisfy our require-ments for a scalable GP classifier. There is no clear

way to distribute computation or use stochastic op-timization in GFITC: we find that it is limited to afew thousand data. Further, the positions of the in-ducing input points Z can be optimized against theapproximation to the marginal likelihood, but there isno guarantee that this will provide a good solution:indeed, our experiments in section 5 show that thisleads to suboptimal behaviour.

To obtain the ability to place inducing inputs as Tit-sias [2009] and to scale as Hensman et al. [2013], we de-sire a bound on the marginal likelihood against whichto optimize Z. In the next section, we attempt to buildsuch a variational approximation in the same fashionas FITC, by first using existing variational methodsto approximate the covariance, and then using furthervariational approximate methods to deal with non-conjugacy.

3 Two stage approaches

A straightforward approach to building sparse GPclassifiers is to separate the low-rank approximationto the covariance from the non-Gaussian likelihood.In other words, simply treat the approximation to thecovariance matrix as the prior, and then select an ap-proximate inference algorithm (e.g. from one of thosecompared by Nickisch and Rasmussen [2008]), andthen proceed with approximate inference, exploitingthe form of the approximate covariance where possi-ble for computation saving.

This is how the generalized FITC approximation wasderived. However, as described above we aim to con-struct variational approximations.

It’s possible to construct a variational approach in thismould by using the fact that the probit likelihood canbe written as a convolution of a unit Gaussian and astep function. Without modifying our original model,we can introduce a set of additional latent variables gwhich relate to the original latent variables f througha unit variance isotropic Gaussian, similar in spirit toGirolami and Rogers [2006]:

p(y, f ,g) =

N∏n=1

B(yn|θ(gn))N (g|f , I)N (f |0,K) (9)

where θ is a step-function inverse-link.The original model (1) is recovered bymarginalization of the additional latent vec-tor:

∫ ∏Nn=1 B(yn | θ(gn))N (g | f , I)dg =∏N

n=1 B(yn |φ(fn)).We can now proceed by using a variational sparse GPbound on g, followed by a further variational approx-imation to deal with the non-Gaussian likelihood.

353

Scalable Variational Gaussian Process Classification

3.1 Sparse mean field approach

Encouraged by the success of a (non-sparse) factoriz-ing approximation made by [Hensman et al., 2014], wecouple a variational bound on p(g) with a mean-fieldapproximation. Substituting the variational bound fora Gaussian sparse GP (6) with our augmented model(9) (where g in (9) replaces y in (6)), we arrive at abound on the joint distribution:

p(y, g) ≥N∏n=1

B(yn | θ(gn))N (g |0,Qnn + I)

exp{− 12 tr(Knn −Qnn)} .(10)

Assuming a factorizing distribution q(g) =∏n q(gn),

we obtain a lower bound on the marginal likelihood inthe usual variational way, and the optimal form of theapproximating distribution is a truncated Gaussian

q?(gn) = B(yn | θ(gn))N (gn | an, σ̃2n)/γn , (11)

where σ̃−2n is given by the nth diagonal element of

[Qnn + I]−1, γn are the required normalizers and an

are free variational parameters. Some cancellation inthe bound leads to a tractable expression:

log p(y) ≥N∑n=1

log γn − 12 log |Qnn + I|

− 12 tr([Qnn + I]−1Eq(g)[gg>])

+ 12

N∑n=1

{log σ̃2n +

Eq(gn)[(an − gn)2]σ̃2n

}− 12 tr(Knn −Qnn). (12)

All the components of this bound can be computedin maximum O(NM2) time. The expectations underthe factorizing variational distribution (and their gra-dients) are available in closed form.

Approximate inference can now proceed by optimizingthis bound with respect to the variational parametersa = {an}Nn=1, alongside the hyper-parameters of thecovariance function and the inducing points Z. Em-pirically, we find it useful to optimize the variationalparameters on an ‘inner loop’, which costs O(NM)per iteration. Computing the relevant gradients out-side this loop costs O(NM2).

Predictions To make a prediction for a new latentfunction value g? at a test input point x?, we wouldlike to marginalize across the variational distribution:

p(g? |y) ≈∫p(g? |g)q(g)dg . (13)

Since this is in general intractable, we approximate itby Monte Carlo. Since the approximate posterior q is

a factorized series of truncated normal distributions,it is straight-forward to sample from. Prediction of asingle test point costs O(N).Alternatively, we can employ a Gaussian approxima-tion to the posterior as suggested by Nickisch andRasmussen [2008]. In our sparse formulation, this re-sults in p(g |y) ≈ N (g |ΣK−1mmKmnEq(g)[g],Σ), withΣ = Kmm − Kmn[Qnn + I]−1Knm. Substituting into (13) results in a computational cost of O(M2) topredict a single test point.

3.2 A more scalable method?

The mean-field method in section 3.1 is somewhat un-satisfactory since the number of variational parametersscales with N . The variational parameters are also de-pendent on each other, and so application of stochasticoptimization or distributed computing is difficult. Forthe Gaussian likelihood case, Hensman et al. [2013]proposes a bound on log p(y) (7) which can be opti-mized effectively with O(M2) parameters. Perhaps itis possible to obtain a scalable algorithm by substitut-ing this bound for p(g) as in the above?

Substituting (7) into (9) (again replacing y with g)results in a tractable integral to obtain a bound onthe marginal likelihood [Hensman et al., 2013]:

log p(y) ≥∏n

B(yn |φ(k>nK−1mmm))

− 12 tr(KnmK−1mmSK−1mmKmn)− 12 tr(Knn −Qnn)−KL[q(u)||p(u)] ,

where k>n is the nth row of Knm. This bound can be

optimized in a stochastic or distributed way [Tolva-nen, 2014], but has been found to be less effective thanmight be expected (Owen Thomas, personal commu-nication). To understand why, consider the case wherethe inducing points are set to the data points Z = X,so that u = f . The bound reduces to

log p(y) ≥∏n

B(yn |φ(mn))

− 12 tr(S)−KL[q(f)||p(f)] .(14)

A straight-forward derivative reveals a unique opti-mum where S−1 = K−1nn + I, and m is the maximuma posteriori (MAP) point. This variational approxi-mation ‘decouples’ the latent function values f fromthe non-Gaussian likelihood throught g, and we areleft with an unsatisfactory posterior variance S whichdepends only on the unit-isotropic noise of g.

This approximation is reminiscent of the Laplace ap-proximation, which also places the mean of the pos-terior at the MAP point, and approximates the co-variance with S−1 = K−1nn + W, where W is the Hes-sian of the log likelihood evaluated at the MAP point.

354

James Hensman, Alexander G. de G. Matthews, Zoubin Ghahramani

The Laplace approximation is known to be relativelyineffective for classification [Nickisch and Rasmussen,2008], so it is no surprise that this variational approx-imation should be ineffective. From here we abandonthis bound: the next section sees the construction of avariational bound which is scalable and (empirically)effective.

4 A single variational bound

Here we obtain a bound on the marginal likelihoodwithout introducing the additional latent variables asabove, and without making factorizing assumptions.We first return to the bound (5) on the conditionalused to construct the variational bounds for the Gaus-sian case:

log p(y |u) ≥ Ep(f |u) [log p(y | f)] (15)

which is in general intractable for the non-conjugatecase. We nevertheless persist, recalling the standardvariational equation

log p(y) ≥ Eq(u) [log p(y |u)]−KL [q(u)||p(u)] . (16)

Substituting (15) into (16) results in a (further) boundon the marginal likelihood:

log p(y) ≥ Eq(u) [log p(y |u)]−KL [q(u)||p(u)]≥ Eq(u)

[Ep(f |u) [log p(y|f)]

]−KL [q(u)||p(u)]

= Eq(f) [log p(y | f)]−KL [q(u)||p(u)] (17)

where we have defined: q(f) :=∫p(f |u)q(u)du.

Consider the case where q(u) = N (u |m,S). Thisgives the following functional form for q(f):

q(f) = N (f |Am ,Knn + A(S−Kmm)A>) (18)

with A = KnmK−1mm. Since in the classification case

the likelihood factors as p(y | f) = ∏Ni=1 p(yi | fi), weonly require the marginals of q(f) in order to computethe expectations in (17). We are left with some one-dimensional integrals of the log-likelihood, which canbe computed by e.g. Gauss-Hermite quadrature, andthe bound log p(y) ≥

N∑n=1

Eq(fn) [log p(yn | fn)]−KL [q(u)||p(u)] . (19)

Our algorithm then consists of maximizing the pa-rameters of q(u) with respect to this bound on themarginal likelihood using gradient based optimization.To maintain positive-definiteness of S, we represent itusing a lower triangular form S = LL>, which allowsus to perform unconstrained optimization.

Computations and Scalability Computing theKL divergence of the bound (19) requires O(M3) com-putations. Since we expect the number of required in-ducing points M to be much smaller than the numberof data N , most of the work will be in computing theexpected likelihood terms. To compute the derivativesof these, we use the Gaussian identities made familiarto us by Opper and Archambeau [2009]:

∂

∂µEN (x|µ,σ2)

[f(x)

]= EN (x|µ,σ2)

[ ∂∂xf(x)

]∂

∂σ2EN (x|µ,σ2)

[f(x)

]= 12EN (x|µ,σ2)

[ ∂2∂x2

f(x)].

(20)

We can make use of these by substituting f forlog p(yn | fn) and µ, σ2 for the marginals of q(f) in (19).These derivatives also have to be computed by quadra-ture methods, after which derivatives with respect tom,L,Z and any covariance function parameters re-quires the application of straight-forward algebra.

We also have the option to optimize the objective ina distributed fashion due to the ease of parallelizingthe simple sum over N , or in a stochastic fashion byselecting mini-batches of the data at random as weshall show in the following. We note that our boundis similar to that proposed by Chai [2012] for multi-nomial probit regression, and also by [Seeger, 2003]for a subset-of-data approach: we propose scalabilityof our method through stochastic optimization of thevariational parameters and inducing input positions.

Predictions Our approximate posterior is given asq(f ,u) = p(f |u)q(u). To make predictions at a setof test points X? for the new latent function valuesf?, we substitute our approximate posterior into thestandard probabilistic rule:

p(f? |y) =∫p(f? | f ,u)p(f ,u |y)dfdu

≈∫p(f? | f ,u)p(f |u)q(u)dfdu

=

∫p(f? |u)q(u)du (21)

where the last line occurs due to the consistency rulesof the GP. The integral is tractable similarly to (18),and we can compute the mean and variance of a test-latent f? in O(M2), from which the distribution of thetest label y? is easily computed.

Limiting cases To relate this method to existingwork, consider two limiting cases of this bound. First,when the inducing variables are equal to the datapoints Z = X, and second, where the likelihood is re-placed with Gaussian noise.

355

Scalable Variational Gaussian Process Classification

KLSP

M=4 M=8 M=16 M=32 M=64 Full

MF

GFIT

C

Figure 1: The effect of increasing the number of inducing points for the banana dataset. Rows represent theKL method, the mean field method and Generalized FITC, whilst columns show increasing numbers of inducingpoints. In each pane, the colored points represent training data, the inducing inputs are black dots and thedecision boundaries are black lines. The rightmost column shows the result of the equivalent non-sparse methods

When Z = X, the approximate posterior reduces toq(f) = N (f |m, S), and the number of parametersrequired to represent the covariance can be reducedto 2N [Opper and Archambeau, 2009], and we haverecovered the full-Gaussian approximation (the ‘KL’method described by Nickisch and Rasmussen [2008]).

If the likelihood were Gaussian, the expectations inequation (19) would be computable in closed form, andafter a little re-arranging the result is as [Hensmanet al., 2013], equation (7). In this case the bound hasa unique solution for m and S, which recovers thevariational bound of Titsias [2009], equation (6). Inthe case where Z = X and the likelihood is Gaussian,exact inference is recovered.

5 Experiments

We have proposed two variational approximations forGP classification. The first in section 3 comprises amean-field approximation after making a variationalapproximation to the prior over an augmented latentvector. The second in section 4 proposes to minimizethe KL divergence using a Gaussian approximation ata set of inducing points. We henceforth refer to theseas the MF (mean-field) and KL methods respectively.

Increasing the number of inducing points Tocompare the methods with the state-of-the-art Gen-eralized FITC method, we first turn to the two-dimensional Banana dataset. For all three methods,we initialized the inducing points using k-means clus-

tering. For the generalized FITC method we used theimplementation provided by Rasmussen and Nickisch[2010]. For all the methods we used the L-BFGS-Boptimizer [Zhu et al., 1997].

With the expectation that increasing the number ofinducing points should improve all three methods, weapplied 4 to 64 inducing points, as shown in Figure1. The KL method pulls the inducing points positionstoward the decision boundary, and provides a near-optimal solution with 16 inducing points. The MFmethod is less able to adapt the inducing input posi-tions, but provides good solutions at the same numberof inducing points. The Generalized FITC methodappears to pull inducing points towards the decisionboundary, but is unable to make good use of 64 induc-ing points, moving some to the decision boundary andsome toward the origin.

Numerical comparison We compared the perfor-mance of the classifiers for a number of commonly usedclassification data sets. We took ten folds of the dataand report the median hold out negative log probabil-ity and 2-σ confidence intervals. For comparison weused the EP FITC implementation from the GPMLtoolbox [Rasmussen and Nickisch, 2010] which is gen-erally considered to be amongst the best implementa-tions of this algorithm. For the KL and sparse KLmethods we found that the optimization behaviourwas improved by freezing the kernel hyper parametersat the beginning of optimization and then unfreezingthem once a reasonable set of variational parameters

356

James Hensman, Alexander G. de G. Matthews, Zoubin Ghahramani

Table 1: comparison of the performance of our proposed methods and Generalized FITC on benchmark datasets.

Dataset N\P KL EP EPFitcM=8

EPFitcM=30%

KLSpM=8

KLSpM=30%

MFSpM=8

MFSpM=30%

thyroid 140\5 .11± .05 .11± .05 .15± .04 .13± .04 .13± .06 .09± .05 .22± .16 .15± .14heart 170\13 .46± .14 .43± .11 .42± .08 .44± .08 .42± .11 .47± .18 .41± .15 .43± .19twonorm 400\20 .17± .43 .08± .01 Large .08± .01 .08± .01 .09± .02 Large Largeringnorm 400\20 .21± .22 .18± .02 .34± .01 .20± .01 .41± .09 .15± .03 .46± .00 Largegerman 700\20 .51± .09 .48± .06 .49± .04 .50± .04 .49± .05 .51± .08 .52± .06 .51± .09waveform 400\21 .23± .09 .23± .02 .22± .01 .22± .01 .23± .01 .25± .04 .28± .01 Largecancer 192\9 .56± .09 .55± .08 .58± .06 .56± .07 .56± .07 .58± .13 .58± .11 .59± .11flare solar 108\9 .59± .02 .60± .02 .57± .02 .57± .02 .59± .02 .58± .02 .64± .05 .60± .04diabetes 468\8 .48± .04 .48± .03 .50± .02 .51± .02 .47± .02 .51± .04 Large .50± .04

has been attained. Table 1 shows the result of theexperiments. In the case where a classifier gives a ex-tremely confident wrong prediction for one or moretest points one can obtain a numerically high negativelog probability. These cases are denoted as ‘large’.

The table shows that mean field based methods of-ten give over confident predictions. The results showsimilar performance between the sparse KL and FITCmethods. The confidence intervals of the two methodseither overlap or sparse KL is better. As we shall see,Sparse KL runs faster in the timed experiments thatfollow and can be run on very large datasets usingstochastic gradient descent.

Time-performance trade-off Since all the algo-rithms used perform optimization there is a trade-offfor amount of time spent on optimization against theclassification performance achieved. The whole trade-off curve can be used to characterize the efficiency ofa given algorithm.

Naish-Guzman and Holden [2007] consider the imagedataset amongst their benchmarks. Of the datasetsthey consider it is one of the most demanding in termsof the number of inducing points that are required. Weran timed experiments on this dataset recording thewall clock time after each function call and computingthe hold out negative log probability of the associatedparameters at each step.

We profiled the MFSp and KLSp algorithms for a va-riety of inducing point numbers. Although GPML isimplemented in MATLAB rather than Python bothhave access to fast numerical linear algebra libraries.Optimization for the sparse KL and FITC methodswas carried out using the LBFGS-B Zhu et al. [1997]algorithm. The mean field optimization surface is chal-lenging numerically and we found that performancewas improved by using the scaled conjugate gradientsalgorithm. We found no significant qualitative differ-ence between the wall clock time and CPU times.

Figure 2 shows the results. The efficient frontier of thecomparison is defined by the algorithm that for any

given time achieves the lowest negative log probabil-ity. In this case the efficient frontier is occupied by thesparse KL method. Each of the algorithms is showingbetter performance as the number of inducing pointsincreases. The challenging optimization behaviour ofthe sparse mean-field algorithm can be seen by the un-predictable changes in hold out probability as the opti-mization proceeds. The supplementary material showsthat in terms of classification error rate this algorithmis much better behaved, suggesting that MFSp finds itdifficult to produce well calibrated predictions.

The supplementary material contains plots of the holdout classification accuracy for the image dataset. Italso contains similar plots for the banana dataset.

Stochastic optimization To demonstrate that theproposed KL method can be optimized effectively ina stochastic fashion, we first turn to the MNIST dataset. Selecting the odd digits and even digits to makea binary problem, results in a training set with 60000points. We used ADADELTA method [Zeiler, 2012]from the climin toolbox [Osendorfer et al., 2014]. Se-lecting a step-rate of 0.1, a mini-batch size of 10 with200 inducing points resulted in a hold-out accuracy of97.8%. The mean negative-log-probability of the test-ing set was 0.069. It is encouraging that we are able tofit highly nonlinear, high-dimensional decision bound-aries with good accuracy, on a dataset size that is outof the range of existing methods.

Stochastic optimization of our bound allows fitting ofGP classifiers to datasets that are larger than previ-ously possible. We downloaded the flight arrival anddeparture times for every commercial flight in the USAfrom January 2008 to April 2008. 1 This dataset con-tains information about 5.9 million flights, includingthe delay in reaching the destination. We build a clas-sifier which was to predict whether any flight was tosubject to delay. As a benchmark, we first fitted a lin-ear model using scikits-learn [Pedregosa et al., 2011]which in turns uses LIBLINEAR [Fan et al., 2008]. On

1Hensman et al. use this data to illustrate regression,here we simply classify whether the delay ≤ zero.

357

Scalable Variational Gaussian Process Classification

100 101 102 103 104

0.1

0.2

0.3

0.4

0.5

Time (seconds)

Holdou

tnegativelogprobab

ility

KLSp M=4KLSp M=50KLSp M=200MFSp M=4MFSp M=50MFSp M=200EPFitc M=4EPFitc M=50EPFitc M=200

Figure 2: Temporal performance of the different methods on the image dataset.

102 103 104 1050.320.340.360.380.4

error(%

)

102 103 104 1050.6

0.62

0.64

0.66

Time (seconds)

-log

p(y)

Figure 3: Performance of the airline delay dataset.The red horizontal line depicts performance of a linearclassifier, whilst the blue line shows performance of thestochastically optimized KLsparse method.

our randomly selected hold-out set of 100000 points,this achieved an error rate of 37%, the negative-logprobability of the held-out data was 0.642.

We built a Gaussian process kernel using the sum ofa Matern- 32 and a linear kernel. For each, we intro-duced parameters which allowed for the scaling of theinputs (sometimes called Automatic Relevance Deter-mination, ARD). Using a similar optimization schemeto the MNIST data above, our method was able to ex-ceed the performance of a linear model in a few min-utes, as shown in Figure 3. The kernel parameters atconvergence suggested that the problem is highly non-linear: the relative variance of the linear kernel wasnegligible. The optimized lengthscales for the Maternpart of the covariance suggested that the most usefulfeatures were the time of day and time of year.

6 Discussion

We have presented two novel variational bounds forperforming sparse GP classification. The first, likethe existing GFITC method, makes an approximationto the covariance matrix before introducing the non-conjugate likelihood. These approaches are somewhatunsatisfactory since in performing approximate infer-ence, we necessarily introduce additional parameters(variational means and variances, or the parametersof EP factors), which naturally scale linearly with N .

Our proposed KLSP bound outperforms the state-of-the art GFITC method on benchmark datasets, and iscapable of being optimized in a stochastic fashion aswe have shown, making GP classification applicable tobig data for the first time.

In future work, we note that this work opens the doorfor several other GP models: if the likelihood factorizesin N , then our method is applicable through Gauss-Hermite quadrature of the log likelihood. We also notethat it is possible to relax the restriction of q(u) toa Gaussian form, and mixture model approximationsfollow straightforwardly, allowing scalable extensionsof Nguyen and Bonilla [2014].

Acknowledgements JH was supported by aMRC fellowship, AM and ZG by EPSRC grantEP/I036575/1, and a Google Focussed Researchaward.

358

James Hensman, Alexander G. de G. Matthews, Zoubin Ghahramani

References

K. M. A. Chai. Variational multinomial logit gaussianprocess. The Journal of Machine Learning Research,13(1):1745–1808, 2012.

C. Cortes and N. D. Lawrence, editors. Advances inNeural Information Processing Systems, Cambridge,MA, 2014. MIT Press.

Z. Dai, A. Damianou, J. Hensman, and N. Lawrence.Gaussian process models with parallelization andgpu acceleration. arXiv, 2014.

A. Damianou and N. Lawrence. Deep gaussian pro-cesses. In Proceedings of the Sixteenth InternationalConference on Artificial Intelligence and Statistics,pages 207–215, 2013.

R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang,and C.-J. Lin. Liblinear: A library for large lin-ear classification. The Journal of Machine LearningResearch, 9:1871–1874, 2008.

Y. Gal, M. van der Wilk, and C. E. Rasmussen. Dis-tributed variational inference in sparse Gaussianprocess regression and latent variable models. InCortes and Lawrence [2014].

M. Girolami and S. Rogers. Variational bayesianmultinomial probit regression with gaussian processpriors. Neural Computation, 18(8):1790–1817, 2006.

J. Hensman, N. Fusi, and N. D. Lawrence. Gaus-sian processes for big data. In A. Nicholson andP. Smyth, editors, Uncertainty in Artificial Intelli-gence, volume 29. AUAI Press, 2013.

J. Hensman, M. Zwießele, and N. D. Lawrence. Tiltedvariational bayes. In S. Kaski and J. Corander, ed-itors, Proceedings of the Seventeenth InternationalConference on Artificial Intelligence and Statistics,volume 33, pages 356–364. JMLR W&CP, 2014.

E. Khan, S. Mohamed, and K. P. Murphy. Fastbayesian inference for non-conjugate gaussian pro-cess regression. In P. Bartlett, editor, Advancesin Neural Information Processing Systems, pages3140–3148, Cambridge, MA, 2012. MIT Press.

M. Kuss and C. E. Rasmussen. Assessing approximateinference for binary Gaussian process classification.The Journal of Machine Learning Research, 6:1679–1704, 2005.

N. Lawrence, M. Seeger, and R. Herbrich. Fast sparseGaussian process methods: The informative vectormachine. In Proceedings of the 16th Annual Con-ference on Neural Information Processing Systems,number EPFL-CONF-161319, pages 609–616, 2003.

N. D. Lawrence. Gaussian process latent variable mod-els for visualisation of high dimensional data. Ad-vances in neural information processing systems, 16:329–336, 2004.

A. Naish-Guzman and S. Holden. The generalized fitcapproximation. In Advances in Neural InformationProcessing Systems, pages 1057–1064, 2007.

T. Nguyen and E. Bonilla. Automated variational in-ference for Gaussian process models. In Cortes andLawrence [2014].

H. Nickisch and C. E. Rasmussen. Approximations forbinary Gaussian process classification. Journal ofMachine Learning Research, 9:2035–2078, 2008.

M. Opper and C. Archambeau. The variational Gaus-sian approximation revisited. Neural computation,21(3):786–792, 2009.

C. Osendorfer, J. Bayer, S. Diot-Girard,T. Rueckstiess, and S. Urban. Climin.http://github.com/BRML/climin, 2014. Ac-cessed: 2014-10-19.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, et al. Scikit-learn: Machinelearning in python. The Journal of Machine Learn-ing Research, 12:2825–2830, 2011.

J. Quiñonero-Candela and C. E. Rasmussen. A uni-fying view of sparse approximate Gaussian processregression. The Journal of Machine Learning Re-search, 6:1939–1959, 2005.

C. E. Rasmussen and H. Nickisch. Gaussian processesfor machine learning (gpml) toolbox. The Journalof Machine Learning Research, 11:3011–3015, 2010.

C. E. Rasmussen and C. K. I. Williams. Gaussian pro-cesses for machine learning. MIT Press, Cambridge,MA, 2006.

M. Seeger. Bayesian Gaussian process models: PAC-Bayesian generalisation error bounds and sparse ap-proximations. PhD thesis, University of Edinburgh,2003.

E. Snelson and Z. Ghahramani. Sparse Gaussian pro-cesses using pseudo-inputs. In Advances in NeuralInformation Processing Systems, pages 1257–1264,2005.

M. Titsias. Variational learning of inducing variablesin sparse Gaussian processes. In D. van Dyk andM. Welling, editors, Proceedings of the Twelfth In-ternational Workshop on Artificial Intelligence andStatistics, volume 5, pages 567–574, ClearwaterBeach, FL, 16-18 April 2009. JMLR W&CP 5.

V. Tolvanen. Gaussian processes with monotonicityconstraints for big data. Master’s thesis, Aalto Uni-veristy, June 2014.

A. G. Wilson, D. A. Knowles, and Z. Ghahramani.Gaussian process regression networks. In J. Lang-ford and J. Pineau, editors, Proceedings of the

359

Scalable Variational Gaussian Process Classification

29th International Conference on Machine Learning(ICML), Edinburgh, June 2012. Omnipress.

M. D. Zeiler. Adadelta: An adaptive learning ratemethod. arXiv preprint arXiv:1212.5701, 2012.

C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal. Algo-rithm 778: L-bfgs-b: Fortran subroutines for large-scale bound-constrained optimization. ACM Trans-actions on Mathematical Software (TOMS), 23(4):550–560, 1997.

360