# Gaussian Processes for Big Data - UAIauai.org/uai2013/prints/papers/244.pdf · Gaussian Processes...

date post

10-Jul-2019Category

## Documents

view

215download

0

Embed Size (px)

### Transcript of Gaussian Processes for Big Data - UAIauai.org/uai2013/prints/papers/244.pdf · Gaussian Processes...

Gaussian Processes for Big Data

James Hensman

Dept. Computer ScienceThe University of Sheffield

Sheffield, UK

Nicolo Fusi

Dept. Computer ScienceThe University of Sheffield

Sheffield, UK

Neil D. Lawrence

Dept. Computer ScienceThe University of Sheffield

Sheffield, UK

Abstract

We introduce stochastic variational inferencefor Gaussian process models. This enablesthe application of Gaussian process (GP)models to data sets containing millions ofdata points. We show how GPs can be vari-ationally decomposed to depend on a setof globally relevant inducing variables whichfactorize the model in the necessary mannerto perform variational inference. Our ap-proach is readily extended to models withnon-Gaussian likelihoods and latent variablemodels based around Gaussian processes. Wedemonstrate the approach on a simple toyproblem and two real world data sets.

1 Introduction

Gaussian processes [GPs, Rasmussen and Williams,2006] are perhaps the dominant approach for inferenceon functions. They underpin a range of algorithmsfor regression, classification and unsupervised learn-ing. Unfortunately, when applying a Gaussian processto a data set of size n exact inference has complexityO(n3) with storage demands of O(n2). This hindersthe application of these models for many domains. Inparticular, large spatiotemporal data sets, video, largesocial network data (e.g. from Facebook), populationscale medical data sets, models that correlate acrossmultiple outputs or tasks (for these models complex-ity is O(n3p3) and storage is O(n2p2) where p is thenumber of outputs or tasks). Collectively we can thinkof these applications as belonging to the domain of bigdata.

Traditionally in Gaussian process a large data set isone that contains over a few thousand data points.

Also at Sheffield Institute for Translational Neuro-science, SITraN

Even to accommodate these data sets, various approx-imate techniques are required. One approach is to par-tition the data set into separate groups [e.g. Snelsonand Ghahramani, 2007, Urtasun and Darrell, 2008].An alternative is to build a low rank approximationto the covariance matrix based around inducing vari-ables [see e.g. Csato and Opper, 2002, Seeger et al.,2003, Quinonero Candela and Rasmussen, 2005, Tit-sias, 2009]. These approaches lead to a computationalcomplexity of O(nm2) and storage demands of O(nm)where m is a user selected parameter governing thenumber of inducing variables. However, even thesereduced storage are prohibitive for big data, wheren can be many millions or billions. For parametricmodels, stochastic gradient descent is often applied toresolve this storage issue, but in the GP domain, ithasnt been clear how this should be performed. Inthis paper we show how recent advances in variationalinference [Hensman et al., 2012, Hoffman et al., 2012]can be combined with the idea of inducing variablesto develop a practical algorithm for fitting GPs usingstochastic variational inference (SVI).

2 Sparse GPs Revisited

We start with a succinct rederivation of the variationalapproach to inducing variables of Titsias [2009]. Thisallows us to introduce notation and derive expressionswhich allow for the formulation of a SVI algorithm.

Consider a data vector1 y, where each entry yi is anoisy observation of the function f(xi), for all thepoints X = {xi}ni=1. We consider the noise to be in-dependent Gaussian with precision . Introducing aGaussian process prior over f(), let the vector f con-tain values of the function at the points X. We shallalso introduce a set of inducing variables: let the vec-tor u contain values of the function f at the pointsZ = {zi}mi=1 which live in the same space as X. Us-

1Our derivation trivially extends to multiple indepen-dent output dimensions, but we omit them here for clarity.

ing standard Gaussian process methodologies, we canwrite

p(y | f) =N(y|f , 1I

),

p(f |u) =N(f |KnmK1mmu, K

),

p(u) =N (u|0,Kmm) ,

where Kmm is the covariance function evaluated be-tween all the inducing points and Knm is the covari-ance function between all inducing points and train-ing points and we have defined with K = Knn KnmK

1mmKmn.

We first apply Jensens inequality on the conditionalprobability p(y |u):

log p(y |u) = log p(y | f)p(f |u)log p(y | f)p(f |u) , L1. (1)

where p(x) denotes an expectation under p(x). ForGaussian noise taking the expectation inside the logis tractable, but it results in an expression containingK1nn , which has a computational complexity of O(n3).Bringing the expectation outside the log gives a lowerbound, L1, which can be computed with has complex-ity O(m3). Further, when p(y|f) factorises across thedata,

p(y|f) =n

i=1

p(yi|fi),

then this lower bound can be shown to be separableacross y giving

exp(L1) =n

i=1

N(yi|i, 1

)exp

(1

2ki,i

)(2)

where = KnmK1mmu and ki,i is the ith diagonal

element of K. Note that the difference between ourbound and the original log likelihood is given by theKullback Leibler (KL) divergence between the poste-rior over the mapping function given the data and theinducing variables and the posterior of the mappingfunction given the inducing variables only,

KL (p(f |u) p(f |u,y)) .

This KL divergence is minimized when there arem = n inducing variables and they are placed atthe training data locations. This means that u = f ,Kmm=Knm=Knn meaning that K=0. In this casewe recover exp(L1) = p(y|f) and the bound becomesequality because p(f |u) is degenerate. However, sincem = n and that there would be no computationalor storage advantage from the representation. Whenm < n the bound can be maximised with respect toZ (which are variational parameters). This minimises

the KL divergence and ensures that Z are distributedamongst the training data X such that all ki,i aresmall. In practice this means that the expectations in(1) are only taken across a narrow domain (ki,i is themarginal variance of p(fi|u)), keeping Jensens boundtight.

Before deriving the expressions for stochastic varia-tional inference using L1, we recover the bound of Tit-sias [2009] by marginalising the inducing variables,

log p(y |X) = logp(y |u)p(u) du

log

exp {L1} p(u) du , L2, (3)

which with some linear algebraic manipulation leadsto

L2 = logN(y|0,KnmK1mmKmn + 1I

)1

2tr(K),

matching the result of Titsias, with the implicit ap-proximating distribution q(u) having precision

= K1mmKmnKnmK1mm + K

1mm

and meanu = 1K1mmKmny.

3 SVI for GPs

One of the novelties of the Titsias bound was that,rather than explicitly representing a variational distri-bution for q(u), these variables are collapsed [Hens-man et al., 2012]. However, for stochastic variationalinference to work on Gaussian processes, it turns outwe need to maintain an explicit representation of theseinducing variables.

Stochastic variational inference (SVI) allows varia-tional inference for very large data sets, but it canonly be applied to probabilistic models which have aset of global variables, and which factorise in the ob-servations and latent variables as Figure 1(a). Gaus-sian Processes do not have global variables and exhibitno such factorisation (Figure 1(b)). By introducinginducing variables u, we have an appropriate modelfor SVI (Figure 1(c)). Unfortunately, marginalising ure-introduces dependencies between the observations,and eliminates the global parameters. In the following,we derive a lower bound on L2 which includes an ex-plicit variational distribution q(u), enabling SVI. Wethen derive the required natural gradients and discusshow latent variables might be used.

Because there are a fixed number of inducing variables(specified by the user at algorithm design time) wecan perform stochastic variational inference, greatlyincreasing the size of data sets to which we can applyGaussian processes.

(a) Requirements for SVI (b) Gaussian Process regression (c) Variational GP regression

Figure 1: Graphical models showing (a) the reqired form for a probabilistic model for SVI (reproduced from[Hoffman et al., 2012]), with global variables g and latent variables z. (b) The graphical model corresponding toGaussian process regression, where connectivity between the values of the function fi is denoted by a loop aroundthe plate. (c) The graphical model corresponding to the sparse GP model, with inducing variables u working asglobal variables, and the term L1 acting as log p(yi |u,xi). Marginalisation of u leads to the variational DTCformulation, introducing dependencies between the observations.

3.1 Global Variables

To apply stochastic variational inference to a Gaussianprocess model, we must have a set of global variables.The variables u will perform this role, and we intro-duce a variational distribution q(u), and use it to lowerbound the quantity p(y |X).

log p(y |X) L1 + log p(u) log q(u)q(u) , L3.

From the above we know that the optimal distribu-tion is Gaussian, and we parametrise it as q(u) =N (u |m,S). The bound L3 becomes

L3 =n

i=1

{logN

(yi|k>i K1mmm, 1

) 1

2ki,i

1

2tr (Si)

}KL (q(u) p(u)) (4)

with ki being a vector of the ith column of Kmn and

i = K1mmkik

>i K1mm. The gradients of L3 with re-

spect to the parameters of q(u) are

L3m

= K1mmKmny m,

L3S

=1

2S1 1

2. (5)

Setting the derivatives to zero recovers the optimalsolution found in the previous section, namely S=1,m=u. It follows that L2 L3, with equality at thisunique maximum.

The key propery of L3 is that is can be written asa sum of n terms, each corresponding to one input-output pair {xi, yi}: we have induced the necessaryfactorisation to perform stochastic gradient methodson the distribution q(u).

3.2 Natural Gradients

Stochastic variational inference works by taking stepsin the direction of the approximate natural gradientg(), which is given by the usual gradient re-scaledby the inverse Fisher information: g()=G()1 L .To work with the natural gradients of the distributionq(u), we first recall the canonical and expectation pa-rameters

1=S1m, 2=

1

2S1

and

1=m, 2=mm>+S.

In the exponential family, properties of the Fisher in-formation reveal the following simplification of the nat-ural gradient [Hensman et al., 2012],

g() = G()1L3

=L3

. (6)

A step of length ` in the natural gradient direction,using (t+1) = (t) + `

dL3d , yields

2(t+1) = 1

2S1(t+1)

= 12S1(t) + `

(1

2 +

1

2S1(t)

),

1(t+1) = S1(t+1)m(t+1)

= S1(t)m(t) + `(K1mmKmny S1(t)m(t)

),

and taking a step of unit length then recovers the samesolution as above by either (3) or (5). This confirmsthe result discussed in Hensman et al. [2012], Hoffmanet al. [2012], that taking this unit step is the same as

performing a VB update. We can now obtain stochas-tic approximations to the natural gradient by consid-ering the data either individually or in mini-batches.

We note the convenient result that the natural gradi-ent for 2 is positive definite (note = K

1mm+

i i).

This means that taking a step in that direction alwaysleads to a positive definite matrix, and our implemen-tation need not parameterise S in any way so as toensure positive-definiteness, cf. standard gradient ap-proaches on covariance matrices.

To optimise the kernel hyper-parameters and noiseprecision , we take derivatives of the bound L3 andperform standard stochastic gradient descent along-side the variational parameters. An illustration is pre-sented in Figure 2.

3.3 Latent Variables

The above derivations enable online learning for Gaus-sian process regression using SVI. Several GP basedmodels involve inference of X, such as the GP latentvariable model [Lawrence, 2005, Titsias and Lawrence,2010] and its extensions [e.g. Damianou et al., 2011,2012].

To perform stochastic variational inference with latentvariables, we require a factorisation as illustrated byFigure 1(a): this factorisation is provided by (4). Toget a model like the Bayesian GPLVM, we need a lowerbound on log p(y). In Titsias and Lawrence [2010] thiswas achieved through approximate marginalisation ofL2, w.r.t. X, which leads to an expression dependingonly on the parameters of q(X). However this formu-lation scales poorly, and the variables of the optimisa-tion are closely connected due to the marginalisationof u. The above enables a lower bound to which SVIis immediately applicable:

log p(y) = log

p(y |X)p(X) dX

q(X)

{L3 + log p(X) log q(X)

}dX.

It is straightforward to introduce p output dimensionsfor the data Y, and following Titsias and Lawrence[2010], we use a factorising normal distribution q(X) =n

i=1 q(xi). The relevant expectations of L3 aretractable for various choices of covariance function.

To perform SVI in this model, we now alternate be-tween selecting a minibatch of data, and optimistingthe relevant variables of q(X) with q(u) fixed, and up-dating q(u) using the approximate natural gradient.We note that the form of (4) means that each of thelatent variable distributions may be updated individ-ually, enabling parallelisation across the minibatch.

3.4 Non-Gaussian likelihoods

Another advantage of the factorisation of (4) is thatit enables a simple routine for inference with non-Gaussian likelihoods. The usual procedure for fittingGPs with non-Gaussian likelihoods is to approximatethe likelihood using either a local variational lowerbound [Gibbs and MacKay, 2000], or by expectationpropagation [Kuss and Rasmussen, 2005]. These ap-proximations to the likelihood are required because ofthe connections between the variables f .

In L3, the bound factorises in such a way that somenon-Gaussian likelihoods may be marginalised exactly,given the existing approximations. To see this, con-sider that we are presented not with the vector y, butby a binary vector t with ti {0, 1}, and the likelihoodp(t |y) =

ni=1 (yi)

ti(1(yi))(1ti), as in the case ofclassification. We can bound the marginal likelihoodusing p(t |X)

p(t |y) exp{L3}dy which involves n

independent one dimensional integrals due to the fac-torising nature of L3. For the probit likelihood eachof these integrals is tractable.

This kind of approximation, where the likelihood is in-tegrated exactly is amenable to SVI in the same man-ner as the regression case above through computationof the natural gradient.

4 Experiments

4.1 Toy Data

To demonstrate our algorithm we begin with two sim-ple toy datasets based on sinusoidal functions. In thefirst experiment we show how the approximation con-verges towards the true solution as mini-batches areincluded. Figure 2 shows the nature of our approxi-mation: the variational approximation to the inducingfunction variables is shown.

The second toy problem (Figure 3) illustrates the con-vergence of the algorithm on a two dimensional prob-lem, again based on sinusoids. Here, we start witha random initialisation for q(u), and the model con-verges after 2000 iterations. We found empirically thatholding the covariance parameters fixed for the firstepoch results in more reliable convergence, as can beseen in Figure 4

4.2 UK Apartment Price Data

Our first large scale Gaussian process mod-els the changing cost of apartments in theUK. We downloaded the monthly price paiddata for the period February to October2012 from http://data.gov.uk/dataset/

Figure 2: Stochastic variational inference on a trivial GP regression problem. Each pane shows the posterior ofthe GP after a batch of data, marked as solid points. Previoulsy seen (and discarded) data are marked as emptypoints, the distribution q(u) is represented by vertical errorbars.

(a) Initial condition (b) Condition at 2000 iterations

Figure 3: A two dimensional toy demo, showing the initial condition and final condition of the model. Data aremarked as colored points, and the models prediction is shown as (similarly colored) contour lines. The positionsof the inducing variables are marked as empty circles.

Figure 4: Convergence of the SVIGP algorithm on thetwo dimensional toy data

land-registry-monthly-price-paid-data/, whichcovers England and Wales, and filtered for apart-ments. This resulted in a data set with 75,000 entries,which we cross referenced against a postcode databaseto get lattitude and longitude, on which we regressedthe normalised logarithm of the apartment prices.

Randomly selecting 10,000 data as a test set, we builda GP as described with a covariance function k(, )consisting of four parts: two squared exponential co-variances, initialised with different length scales wereused to account for national and regional variations inproperty prices, a constant (or bias) term allowed fornon-zero mean data, and a noise variance accountedfor variation that could not be modelled using simplylatitude and longitude.

We selected 800 inducing input sites using a k-meansalgorithm, and optimised the parameters of the co-variance function alongside the variational parameters.We performed some manual tuning of the learningrates: empirically we found that the step length shouldbe much higher for the variational parameters of q(u)than for the values of the covariance function parame-ters. We used 0.01 and 1 105. Also, we included amomentum term for the covariance function parame-ters (set to 0.9). We tried including momentum termsfor the variational parameters, but we found this hin-dered performance. A large mini-batch size (1000) re-duced the stochasticity of the gradient computations.We judged that the algorithm had converged after 750iterations, as the stochastic estimate of the marginallower bound on the marginal likelihood failed to in-crease further.

For comparison to our model, we constructed a se-ries of GPs on subsets of the training data. Splittingthe data into sets of 500, 800, 1000 and 1200, we fit-

Figure 5: Variability of apartment price (logarithmi-cally!) throughout England and Wales.

ted a GP with the same covariance function as ourstochastic GP. Parameters of the covariance functionwere optimised using type-II maximum likelihood foreach batch. Table 1 reports the mean squared error inour models prediction of the held out prices, as wellas the same for the random sub-set approach (alongwith two standard deviations of the inter-sub-set vari-ability).

Table 1: Mean squared errors in predicting the log-apartment prices across England and Wales by latti-tude and longitude

Mean square Error

SVIGP 0.426Random sub-set (N=500) 0.522 +/- 0.018Random sub-set (N=800) 0.510 +/- 0.015Random sub-set (N=1000) 0.503 +/- 0.011Random sub-set (N=1200) 0.502 +/- 1.012

4.3 Airline Delays

The second large scale dataset we considered consistsof flight arrival and departure times for every commer-cial flight in the USA from January 2008 to April 2008.This dataset contains extensive information about al-most 2 million flights, including the delay (in minutes)in reaching the destination. The average delay of aflight in the first 4 months of 2008 was of 30 minutes.Of course, much better estimates can be given by ex-ploiting the enourmous wealth of data available, butrich models are often overlooked in these cases due

Figure 6: Posterior variance of apartment prices.

to the sheer size of the dataset. We randomly selected800,000 datapoints 2, using a random subset of 700,000samples to train the model and 100,000 to test it. Wechose to include into our model 8 of the many variablesavailable for this dataset: the age of the aircraft (num-ber of years since deployment), distance that needs tobe covered, airtime, departure time, arrival time, dayof the week, day of the month and month.

We built a Gaussian process with a squared exponen-tial covariance function with a bias and noise term.In order to discard irrelevant input dimensions, we al-lowed a separate lengthscale for each input. For ourexperiments, we used m = 1000 inducing inputs anda mini-batch size of 5000. The learning rate for thevariational parameters of q(u) was set to 0.01, whilethe learning rate for the covariance function parame-ters was set to 1 105. We also used a momentumterm of 0.9 for the covariance parameters.

For the purpose of comparison, we fitted several GPswith an identical covariance function on subsets of thedata. We split the data into sets of 800, 1000 and 1200samples and optimised the parameters using type-IImaximum likelihood. We repeated this procedure 10times.

The left pane of Figure 7 shows the root mean squarederror (RMSE) obtained by fitting GPs on subsets ofthe data. The right pane of figure 7 shows the RMSEobtained by fitting 10 SVI GPs as a function of theiteration. The individual runs are shown in light gray,while the blue line shows the average RMSE across

2Subsampling wasnt technically necessary, but wedidnt want to overburden the memory of a shared computenode just before a submission deadline.

Figure 8: Root mean square errors for models withdifferent numbers of inducing variables.

Figure 9: Automatic relevance determination param-eters for the features used for predicting flight delays.

runs.

One of the main advantages of the approach presentedhere is that the computational complexity is indepen-dent from the number of samples n. This allowed usto use a much larger number of inducing inputs thanhas traditionally been possible. Conventional sparseGPs have a computational complexity of O(nm2), sofor large n the typical upper bound for m is between 50and 100. The impact on the prediction performance isquite significant, as highlighted in Figure 8, where wefit several SVI GPs using different numbers of inducinginputs.

Looking at the inverse lengthscales in Figure 9, itspossible to get a better idea of the relevance of thedifferent features available in this dataset. The mostrelevant variable turned out to be the time of departureof the flight, closely followed by the distance that needs

Figure 7: Root mean squared errors in predicting flight delays using information about the flight.

to be covered. Distance and airtime should in theorybe correlated, but they have very different relevances.This can be intuitively explained by considering thaton longer flights its easier to make up for delays atdeparture.

5 Discussion

We have presented a method for inference in Gaussianprocess models using stochastic variational inference.These expressions allow for the transfer of a multitudeof Gaussian process techniques to big data.

We note several interesting results. First, the ourderivation disusses the bound on p(y |u) in detail,showing that it becomes tight when Z = X.

Also, we have that there is a unique solution for the pa-rameters of q(u) such that the bound associated withthe standard variational sparse GP [Titsias, 2009] isrecovered.

Further, since the complexity of our model is nowO(m3) rather than O(nm2), we are free to increasem to much greater values than the sparse GP repre-sentation. The effect of this is that we can have muchricher models: for a squared exponential covariancefunction, we have far more basis-functions with whichto model the data. In our UK apartment price exam-ple, we had no difficulty setting m to 800, much higherthan experience tells us is feasible with the sparse GP.

The ability to increase the number of inducing vari-ables and the applicability to unlimited data make ourmethod suitable for multiple output GPs [Alvarez andLawrence, 2011]. We have also briefly discussed how

this framework fits with other Gaussian process basedmodels such as the GPLVM and GP classification. Weleave the details of these implementations to futurework.

In all our experiments our algorithm was run on asingle CPU using the GPy Gaussian process toolkithttps://github.com/SheffieldML/GPy.

References

Mauricio A. Alvarez and Neil D. Lawrence. Compu-tationally efficient convolved multiple output Gaus-sian processes. Journal of Machine Learning Re-search, 12:14251466, May 2011.

Lehel Csato and Manfred Opper. Sparse on-line Gaus-sian processes. Neural Computation, 14(3):641668,2002.

Andreas Damianou, Michalis K. Titsias, and Neil D.Lawrence. Variational Gaussian process dynamicalsystems. In Peter Bartlett, Fernando Peirrera, ChrisWilliams, and John Lafferty, editors, Advances inNeural Information Processing Systems, volume 24,Cambridge, MA, 2011. MIT Press.

Andreas Damianou, Carl Henrik Ek, Michalis K. Tit-sias, and Neil D. Lawrence. Manifold relevance de-termination. In John Langford and Joelle Pineau,editors, Proceedings of the International Conferencein Machine Learning, volume 29, San Francisco, CA,2012. Morgan Kauffman. To appear.

Mark N. Gibbs and David J. C. MacKay. VariationalGaussian process classifiers. IEEE Transactions onNeural Networks, 11(6):14581464, 2000.

James Hensman, Magnus Rattray, and Neil D.

Lawrence. Fast variational inference in the expo-nential family. NIPS 2012, 2012.

Matthew Hoffman, David M. Blei, Chong Wang, andJohn Paisley. Stochastic variational inference. arXivpreprint arXiv:1206.7051, 2012.

Malte Kuss and Carl Edward Rasmussen. Assess-ing approximate inference for binary Gaussian pro-cess classification. Journal of Machine Learning Re-search, 6:16791704, 2005.

Neil D. Lawrence. Probabilistic non-linear principalcomponent analysis with Gaussian process latentvariable models. Journal of Machine Learning Re-search, 6:17831816, 11 2005.

Joaquin Quinonero Candela and Carl Edward Ras-mussen. A unifying view of sparse approximateGaussian process regression. Journal of MachineLearning Research, 6:19391959, 2005.

Carl Edward Rasmussen and Christopher K. I.Williams. Gaussian Processes for Machine Learn-ing. MIT Press, Cambridge, MA, 2006. ISBN 0-262-18253-X.

Matthias Seeger, Christopher K. I. Williams, andNeil D. Lawrence. Fast forward selection to speedup sparse Gaussian process regression. In Christo-pher M. Bishop and Brendan J. Frey, editors, Pro-ceedings of the Ninth International Workshop on Ar-tificial Intelligence and Statistics, Key West, FL, 36Jan 2003.

Edward Snelson and Zoubin Ghahramani. Local andglobal sparse Gaussian process approximations. InMarina Meila and Xiaotong Shen, editors, Proceed-ings of the Eleventh International Workshop on Ar-tificial Intelligence and Statistics, San Juan, PuertoRico, 21-24 March 2007. Omnipress.

Michalis K. Titsias. Variational learning of inducingvariables in sparse Gaussian processes. In Davidvan Dyk and Max Welling, editors, Proceedings ofthe Twelfth International Workshop on Artificial In-telligence and Statistics, volume 5, pages 567574,Clearwater Beach, FL, 16-18 April 2009. JMLRW&CP 5.

Michalis K. Titsias and Neil D. Lawrence. BayesianGaussian process latent variable model. InYee Whye Teh and D. Michael Titterington, editors,Proceedings of the Thirteenth International Work-shop on Artificial Intelligence and Statistics, vol-ume 9, pages 844851, Chia Laguna Resort, Sar-dinia, Italy, 13-16 May 2010. JMLR W&CP 9.

Raquel Urtasun and Trevor Darrell. Local probabilis-tic regression for activity-independent human poseinference. In Proceedings of the IEEE Computer So-ciety Conference on Computer Vision and PatternRecognition, Anchorage, Alaska, 2008.