Computational Aspects of Gaussian Process Regression · Computational Aspects of Gaussian Process...

MSc Stochastics and Financial Mathematics

Master Thesis

Computational Aspects of GaussianProcess Regression

Author: Supervisor:Camilla Tioli dr. B. J. K. Kleijn (UvA)

Examination date: Daily supervisor:August 29, 2018 dr. W. Bergsma (LSE)

Korteweg-de Vries Institute forMathematics

London School of Economics andPolitical Science

Abstract

Gaussian Process Regression (GPR) has proven to be an extremely useful methodologyin classification, machine learning, geostatistics, genetic analysis, and so on. Estima-tion of GPR hyperparameters can be done using empirical Bayes methods, that is, bymaximizing the marginal likelihood function. A major potential problem, which appearsto have received relatively little attention in the literature, is the possible existence ofmultiple local maxima of the marginal likelihood, as function of the hyperparameters,even for relatively simple problems. In fact, due to the non-convexity of the function,the optimization algorithm may not converge to the global maximum. This project willinvestigate the circumstances in which multiple local maxima can occur and strategiesfor finding the global maximum. Analyzing a first simple case, the work will be basedon mathematical and numerical analysis of the marginal likelihood function and simula-tions. The results reveal that the estimated probability of having multiple local maxima,even for a very simple case, is positive.

Title: Computational Aspects of Gaussian Process RegressionAuthor: Camilla Tioli, [email protected] 11407484Supervisor: dr. B. J. K. Kleijn (UvA)Daily supervisor: dr. W. Bergsma (LSE)Second Examiner: dr. S. G. CoxExamination date: August 29, 2018

Korteweg-de Vries Institute for MathematicsUniversity of AmsterdamScience Park 105-107, 1098 XG Amsterdamhttp://kdvi.uva.nl

London School of Economics and Political ScienceHoughton Street, London, WC2A 2AE, UKwww.lse.ac.uk/

2

http://kdvi.uva.nl

www.lse.ac.uk/

Acknowledgements

To me, this thesis represents the final stage of my academic path started 6 years agoand which is most likely ending with the Master of Science degree in Stochastics and Fi-nancial Mathematics at the University of Amsterdam (UvA). This work was conductedduring a research period as a visiting student at the London School of Economics andPolitical Science, Department of Statistics.First of all, I would like to thank both my supervisors, Bas Kleijn (UvA) and WicherBergsma (LSE). I specifically want to thank them for the helpful and useful suggestionsmade, for reviewing my work and most of all for stimulating and pushing me to thinkwhen I was lost or got stuck while conducting this research. As it was my very first workand attempt to research, this thesis has been a big challenge for me.Secondly, I would like to thank the LSE and the UvA for giving me the valuable opportu-nity to conduct my research and write my thesis in another university and environmentand Sonja Cox, for her time and being my second reader.Finally, I am grateful to my family, who has always supported me during my studiesthroughout all these years and made it possible for me to study abroad, to my bestfriend Federica, my person and my rock, who has never stopped encouraging me to pur-sue my goals, to my friend Isabella, who made of Amsterdam my second home, to mymost faithful friends and fellow students of this master Caroline, Steven and Giovanni,without whom this master would not have been possible and to all my old and newfriends who made this experience unique.

3

Contents

Introduction 6

1. Gaussian Process Regression 101.1. Gaussian processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.2. Gaussian process regression model . . . . . . . . . . . . . . . . . . . . . . 111.3. Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.4. Covariance functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2. Estimating GPR hyperparameters 162.1. Two different approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1.1. The fully Bayes approach . . . . . . . . . . . . . . . . . . . . . . . 172.1.2. The empirical Bayes approach . . . . . . . . . . . . . . . . . . . . . 18

2.2. Existence of multiple local maxima . . . . . . . . . . . . . . . . . . . . . . 202.3. Asymptotic behaviour of the empirical Bayes methods . . . . . . . . . . . 22

3. Regression with a centered Brownian motion prior 243.1. Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1.1. The case n=2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.2. Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4. Experimental Results 374.1. Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1.1. The line search method . . . . . . . . . . . . . . . . . . . . . . . . 384.1.2. The trust region method . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2. Setting the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.3. Results: number of local maxima and location of the global maximum . . 414.4. Comments and problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5. Conclusion 46

A. Appendix A 48A.1. Bayes’s rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48A.2. Eigendecomposition of a matrix . . . . . . . . . . . . . . . . . . . . . . . . 49A.3. Matrix Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50A.4. Gaussian Identities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

B. Appendix B 51B.1. The Linear Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . 51

4

B.2. The Nonlinear Conjugate Gradient Method . . . . . . . . . . . . . . . . . 53

Popular summary 55

Bibliography 56

5

Introduction

Over the last few decades, Gaussian process regression (GPR) has proven to be an ex-tremely useful methodology in classification, machine learning, geostatistics, geneticsand so on, due to many interesting properties, such as e.g. ”ease of obtaining and ex-pressing uncertainty in predictions and the ability to capture a wide variety of behaviourthrough a simple parametrization” [13].

Consider a dataset D of n observations, D = (xi, yi) : i = 1, . . . , n, let us denotewith X the vector of input values and Y the vector of the outputs or target values,which can be either continuous (in the regression case) or discrete (in the classificationcase), and we wish to make prediction for new inputs X∗ that are not in the trainingdata. Then, the usual regression model is given by

yi = f(xi) + εi, for i = 1, . . . , n, (0.1)

where εi is the additive noise or error that we assume to be Gaussian distributed, namelysuch that for every i = 1, . . . , n the errors are identical independent normal distributed

εi ∼ N (0, σ2n).

We now need a prediction function f which can make predictions for all possible in-put values. A wide variety of methods have been proposed to deal with this problem.One approach consists of restricting our attention to a specific class of functions, letus say, for example, only polynomial functions of the inputs. However, an immediateissue comes along. We are deciding upon the richness of the class of functions. If we forinstance choose to use the class of polynomial functions and the target function is notwell modelled by this class, then the predictions will be poor. Hence, another approachconsists of giving prior probability to predictive functions, where higher probabilities aregiven to functions considered to be more likely than others, for instance the class of alldifferentiable functions. Yet again a problem remains, i.e. there exists a much largerclass of possible functions. This is where Gaussian process comes [1]. Gaussian pro-cesses are ”widely used as building blocks for prior distributions of unknown functionalparameters” [11]. Thus, in our case, for the prior distribution of the predictive functionf, which can be seen as a parameter of the GPR model.We have that a stochastic process indexed by a set T is a collection of random vari-ables W = (W (t) : t ∈ T ) defined on a common probability space [27]. Then, a Gaus-sian process W is a stochastic process such that the finite-dimensional distributionsof (W (t1),W (t2), . . . ,W (tm)) for all t1, . . . , tm ∈ T and m ∈ Z, are multivariate nor-mally distributed [27]. In particular, we have that the finite dimensional-distributions

6

are completely specified by the mean function m : T → R and covariance functionk(s, t) : T × T → R, given by

m(t) = E [W (t)] and k(s, t) = cov(W (s),W (t)).

Furthermore, if we consider the Gaussian process W, we can define the sample path t 7−→W (t) of W, which is a random function, and the law of W is then a prior distributionon the space of functions f : T → R. Thus, we see that a stochastic process governs theproperties of functions [1].If we go back again to the regression model (0.1) and the prediction function f, since wecan think of a function as a very long vector, such that each entry is the function valueat a particular input x, i.e. f(x), it comes out that Gaussian process is exactly what weneeded. Then the joint distribution of f(x) : x ∈ X is given by the law

f(x) ∼ N (m(x), k(x, x′))

for every x, x′ ∈ X , where X is the input space, and m(x) and k(x, x′) are the meanfunction and covariance function of the process, respectively. Hence, we put a prior dis-tribution on the functional parameter f. A crucial property is that the joint distributionof any finite set of function values is still Gaussian. Indeed, we have a nonparametricregression model, but we always look at f at a finite number of input values. Moreover,Gaussian priors are conjugate in the GPR model, i.e. if we look at the posterior distri-bution, this is still Gaussian. In particular, the distribution of the vector of the targetvalues Y given X and f, will be Gaussian distributed.Indeed, one of the main attraction of the Gaussian process framework is that it combinesa consistent view with computational tractability [1]. For these reasons and others, suchas flexibility and good performance of Gaussian processes, and what we have alreadysaid at the beginning, Gaussian process regression became an effective method for non-linear regression problems [1, 13].Moreover, Neal [18] showed that many Bayesian regression models based on neural net-works converge to Gaussian processes in the limit of an infinite network, and so Gaussianprocesses have been used as a replacement for supervised neural networks in non-linearregression and classification [13, 17]. In addition, empirical studies showed that Gaussianprocess regression have better predictive performance than other nonparametric models(Rasmussen 1996 [19]) such as Support Vector Machine (SVM) [24, 25].

Gaussian process regression models rely on an appropriate selection of kernel, namely onthe covariance function and the hyperparameters on which it depends. The covariancefunction encodes the assumptions we made about the function we wish to learn and itdefines the similarity or nearness between datapoints [1]. Thus, the choice of the kernelhas a big impact on the performance, and hence, predictive value of the GPR model.Once the kernel, which controls the characteristics of the GPR model, has been chosen,the unknown hyperparameters on which it relies need to be determined. The commonapproach is to estimate them from the training data. Two methods can be adopted,the fully Bayes method which, through a hierarchical specification of models, gives prior

7

distribution to the hyperparameters (often called hyperprior) and the empirical Bayes(EB) method, which estimates the hyperparameters directly from the data. The first ap-proach sees the implementation of Markov Chain Monte Carlo (MCMC) methods, thatcan perform the Gaussian process regression without the need of estimating hyperpa-rameters [14, 17]. However, due to the high computational cost of MCMC methods, thehyperparameters are often estimated by means of maximum marginal likelihood [13, 18](the EB approach), which considers the marginal likelihood as function of the hyperpa-rameters and estimates them by maximizing it.Unfortunately, the marginal likelihood is not a convex function of the hyperparameters.A potential issue (which will form the focus of this thesis but received relatively littleattention in the literature) is the existence of multiple local maxima. In fact, multiplelocal optima can exist [16] and the optimized hyperparameters may not be the globalmaxima [15, 17].

The main focus of this thesis will be the analytical and numerical study of the marginallikelihood of a specific Gaussian process regression model. We will focus our attentionon a very simple case. Let us consider the regression model given at the beginning (0.1).Thus, we are going to put on f a prior given by the distribution of a centered Brownianmotion with scale parameter α and we are going to assume that ε ∼ N (0, eβ). There-fore, the research topic will concern properties of the marginal likelihood as function oftwo hyperparameters (the error variance β and the scale parameter α of the Gaussianprocess). These are estimated by means of empirical Bayes methods, i.e. by maximizingthe marginal likelihood. Even in this very simple and restricted case, a major issue isgiven by the existence of multiple local maxima, and thus, the difficulty to find the globalmaximum. Therefore, we investigate the theoretical properties of the marginal likelihoodand using the software Wolfram Mathematica 11.2 we run simulations to investigate theissues that may theoretically be intractable. We then find interesting results, such asthe asymptotic behaviour of the marginal likelihood, and properties of its ridges, whereit is more likely to find the local maxima. Moreover, simulations allow us to computethe estimated probability to have one or more local maxima and to locate the globalmaximum. Finally, and importantly, we find a good automatic and systematic methodof finding a number of (and possibly all) the local maxima, through the implementationof the Conjugate Gradient optimisation algorithm. This study could provide help toresearchers for finding the global maximum of the likelihood in order to retrieve themost useful information from the data.

This thesis is organized as follows. In the first chapter a theoretical framework forGaussian process regression is presented, necessary for the next chapters. Basic defini-tions are given, such as the definition of a Gaussian process, and the main ingredientsof this thesis are described. Gaussian process regression model is then characterized byits covariance function depending on the hyperparameters.

The main purpose of chapter 2 is to introduce the problem of estimating the kernelhyperparameters. Two approaches are presented and explained, the fully Bayes and

8

emprical Bayes methods. Then, the main problem and focus of this thesis is introduced,the existence of multiple local maxima of the marginal likelihood as function of the GPRhyperparameters. Finally, we look at some asymptotic properties of the empirical Bayesmethods.

In the third chapter we present the GPR model that we are going to study. Therewill be a first introductory part, where some things already seen in the first chapter arerecalled. The real aim of this chapter is to theoretically study the marginal likelihoodas a function of the hyperparameters, deriving for instance, the asymptotic behaviourof the function, in order to understand where the local maxima of the function are andhow many local maxima the function can have. The main results are explained in twopropositions, and the proofs follow.

Chapter 4 covers the numerical study of the marginal likelihood. A brief explanation ofthe numerical methods implemented by the software Mathematica is given. Then, usingthe main results of the previous chapters, and in particular of chapter 3, a systematicmethod for finding the multiple local maxima of the marginal likelihood is given. Simu-lations are run in order to estimate the probability of having one or more local maxima,and to find the location of the global maximum. Finally, the experimental results arepresented, and related problems are discussed.

We finish with the conclusion, where we briefly summarize what we saw in the previouschapters, the main results we accomplished, the main problems we faced and lastly welook at possible developments and open problems for further reserach.

9

1. Gaussian Process Regression

In this chapter we set up a theoretical framework for Gaussian process regression, nec-essary for the following chapters. The main concepts and ingredients will be presented.What is a Gaussian process is explained and the special regression model is given. More-over, we are going to see how the Gaussian process regression model can be characterizedand its main purpose, making prediction. Finally, we will briefly introduce what is themain goal of this thesis.

1.1. Gaussian processes

Since Gaussian processes are the main ingredient of the regression model, it is importantto spend a section on them, explaining what a Gaussian process consists of. Hence, firstof all we start with the definition of a stochastic process [10] and its finite-dimensionaldistributions [8].

Definition 1.1. Let (Ω,F ,P) be a probability space. A d-dimensional stochastic processindexed by I ⊂ [0,∞) is a family of random variables Xt : Ω → Rd, t ∈ I. We writeX = (Xt)t∈I . I is called the index set and Rd the state space of the process.

Definition 1.2. Let (Xt)t∈T be a real-valued stochastic process. The distributions ofthe finite-dimensional vectors of the form (Xt1 , Xt2 , . . . , Xtn), for t1 < t2 < · · · < tn, arecalled the finite-dimensional distributions (fdd’s) of the process.

A Gaussian process is a type of stochastic process with specific properties, and it isdefined as it follows.

Definition 1.3. A real-valued stochastic process is called Gaussian if all its fdd’s aremultivariate normal distributions.

In particular, the definition of Gaussian process implies a consistency requirement, inother words, the distribution of a subset of variables is marginal to the distribution of alarger set of variables [1]. In fact, the Kolmogorov’s existence theorem guarantees thata suitably consistent collection of finite-dimensional distributions will define a stochas-tic process. Thus, let us give the definition of consistent fdd’s and the Kolmogorov’sexistence theorem, both taken from [8].

Definition 1.4. Let T ⊂ R and let (E, E) be a measurable space. For all n ∈ Z+ and allt1 < t2 < · · · < tn, ti ∈ T, i = 1, . . . , n, let µt1,...,tn be a probability measure on (En, En).This collection of measures is called consistent if it has the property that

µt1,...,ti−1,ti+1,...,tn (A1 × · · · ×Ai−1 ×Ai+1 × · · · ×An) =

µt1,...,tn (A1 × · · · ×Ai−1 × E ×Ai+1 × · · · ×An) ,

10

for all A1, . . . , Ai−1, Ai+1, . . . , An ∈ E .

Theorem 1.5 (Kolmogorov’s existence theorem). Suppose that E is a Polish space andE its Borel- σ-algebra. Let T ∈ R and for all n ∈ Z+ and all t1 < t2 < · · · < tn ∈ T,ti ∈ T, i = 1, . . . , n, let µt1,...,tn be a probability measure on (En, En). If the measuresµt1,...,tn form a consistent system, then there exists a probability measure P on ET , suchthat the canonical (or co-ordinate variable) process (Xt)t∈T on (Ω = ET ,F = ET , P ),defined by

X(ω) = ω, Xt(ω) = ωt,

has fdd’s µt1,...,tn .

A detailed proof can be found e.g. in Billingsley 1995 [26].

Let us now suppose that we have a Gaussian process X = (Xt)t∈T , thus we can de-fine m(t) = EXt, t ∈ T, as the mean function of the process, and k(s, t) = cov(Xs, Xt),(s, t) ∈ T ×T, as the covariance function of the process. Then, we have that a Gaussianprocess is completely specified by its mean function and covariance function, and wewrite X ∼ GP(m(t), k(s, t)). In general a Gaussian process X with EXt = 0 for all t ∈ Tis called centered.In particular, if a Gaussian process (Xt)t∈T is such that the mean function is m(t) = 0and the covariance function is r(s, t) = s ∧ t, then X is a Brownian motion. Thus, letus give another way to characterize this special stochastic process that we will see againin chapter 3.

Definition 1.6. The stochastic process W = (Wt)t∈T , with T = [0, τ ], is called a(standard) Brownian motion or Wiener process, if

i) W0 = 0 a.s.;

ii) W is a stochastic process with stationary, independent increments, namely Wt −Ws ∼Wt−s −W0 for s ≤ t ≤ τ ;

iii) Wt −Ws ∼ N (0, t− s);

iv) almost all sample paths are continuous.

1.2. Gaussian process regression model

Now that we have introduced what is a Gaussian process we can start talking about theregression model. There are many ways to interpret Gaussian process regression models.One is the weight-space view, or parametric model, which is more familiar and accessibleto many and the other is the function-space view, or non-parametric model, which is theone we are going to follow. Indeed, we can see a Gaussian process as a distribution overfunctions and the inference is taking place directly in the functions space. We are goingto do this in a Bayesian framework [1].

11

Let us suppose we have a training set D = (xi, yi) : i = 1, . . . , n where xi are theinput values and yi the output or target values, such that they follow the regressionmodel given by

yi = f(xi) + εi. (1.1)

We have assumed that the target values yi differ from the function values, f(xi), by afactor εi, the additive noise. Furthermore, we assume that this noise follows an inde-pendent, identically distributed Gaussian distribution with mean zero and variance σ2

n,

εi ∼ N (0, σ2n). (1.2)

One of the main purposes is making inference about the relationship between inputsand targets, namely the conditional distribution of the targets given the inputs. Wedenote X = (x1, . . . , xn) and Y = (y1, . . . , yn), the vectors of input and target values,respectively. By the noise (1.2) and model (1.1) assumptions, we have that the followingholds,

p(Y |X, f) =n∏i=1

1√2πσn

exp(− (yi − f(xi))

2

2σ2n

)(1.3)

=1

(2πσ2n)n/2

exp(− 1

2σ2n

|Y − f|2)

= N (f, σ2nI), (1.4)

where |z| denotes the Euclidean length of the vector z and I is the identity matrix.Finally, we assume that for every x, x′ ∈ X ,

f(x) ∼ GP(m(x), k(x, x′)), (1.5)

which is the so-called prior distribution of f. We see that this time the Gaussian processis given by the function values at each input x ∈ X , i.e. f(x) for x ∈ X . Hence, thestochastic process is no more defined over time, as in definition 1.1. In fact, here insteadof having T as the index set, we now have X , the set of possible inputs. For notationalconvenience we denote f := f(X), fi := f(xi) and we suppose in (1.5) m(x) to be zero forevery x ∈ X , even if it is not necessary. Note that using a zero mean function representsthe lack of prior knowledge, and it does not mean that we expect the predictive functionf distributing in equal way around zero, namely taking positive and negative values overequal parts of its range.We can then compute the posterior distribution of f, given the target values Y and theinput values X, which, by Bayes’s rule (A.4), writing only the terms of the likelihood(1.4) and the prior (1.2) which depend on f, and completing the square, is given by

p(f|X,Y ) ∝ exp(− 1

2σ2n

(Y − f)T (Y − f))

exp(− fTK−1f

)∝ exp

(− 1

2(f− f)T (

1

σ2n

I +K−1)(f− f)),

12

where K is the n × n covariance matrix of f of which the (i, j)−element is given by

Kij = k(xi, xj), f = 1σ2n

(1σ2nI + K

)−1Y, and we recognize the form of the posterior

distribution as Gaussian with mean f and covariance matrix A−1

p(f|X,Y ) = N (f =1

σ2n

A−1Y,A−1), (1.6)

where A = 1σ2nI +K−1.

1.3. Prediction

Even if prediction is not the main purpose of this thesis, for completeness it is importantto briefly see it.

Again, let us suppose we have a dataset D of n observations, D = (xi, yi) : i = 1, . . . , n.Given the training data, let us suppose we have new input values X∗, such that X∗ /∈ D.Thus, we wish to make predictions for X∗. Let us denote f(X∗) := f∗. To predict thefunction values f∗, the joint probability distribution of the observed target values Y andpredictive targets f∗ is given by[Yf∗

]∼ N

(0,

[K(X,X) + σ2

nI K(X,X∗)K(X∗, X) K(X∗, X∗)

] ).

Indeed, it follows from (1.5) where we assumed m(x) = 0 for every x ∈ X , and fromthe fact that Y ∼ N (0,K(X,X) + σ2

nI), since it is the sum of two multivariate normalrandom vectors, f and ε. From this we can derive the conditional distribution of f∗, thekey predictive equation for Gaussian process regression, given by (see A.15)

f∗|X,Y,X∗ ∼ N(f∗, cov(f∗)

), (1.7)

where

f∗ = E [f∗|X,Y,X∗] = K(X∗, X)[K(X,X) + σ2nI]−1Y, (1.8)

cov(f∗) = K(X∗, X∗)−K(X∗, X)[K(X,X) + σ2nI]−1K(X,X∗). (1.9)

We see that (1.8), the predictive mean, is a linear combination of the output observedvalues Y, that is why sometimes is called linear predictor. By contrast, (1.9), the predic-tive variance, does not depend on Y but only on the input values X and X∗. Moreover,we can notice that (1.9) is the difference between K(X∗, X∗) which is simply the priorcovariance, and K(X∗, X)[K(X,X)+σ2

nI]−1K(X,X∗), representing the information theobservations give us about the function [1].

1.4. Covariance functions

The covariance function (or kernel), k(·, ·), plays a crucial role in Gaussian process regres-sion. So far we did not specify any particular covariance function or matrix. However,

13

from equations (1.8) and (1.9) we can see how crucial the choice of k in both predictivemean and variance is. Indeed, according to [1], it encodes the assumptions we madeabout the function we wish to learn and it defines the similarity or nearness betweendatapoints. In fact, close datapoints X are more likely to have the same or closer targetpoints Y, and so training points close to a test point x∗ become informative about f∗, theprediction at that point. In particular, in Gaussian process regression what describesand defines this closeness between the training points is the covariance function. As aresult the choice of the covariance function has an important impact on the prediction.

Kernel covariance functions are defined as k : X × X → R. Anyway, not any arbi-trary function of two input points x, x′ ∈ X can serve as a covariance function.First of all, a function of two arguments, mapping a pair of inputs x, x′ ∈ X into R, iscalled kernel. In particular, a real kernel is said to be symmetric if k(x, x′) = k(x′, x).Then, given the set of inputs X = x1, . . . , xn, we can define the matrix whose entriesare defined such that Kij = k(xi, xj). If k is a covariance function we call the matrix Kcovariance matrix, and it is clear that in this case, by definition, k is symmetric. More-over, K is called positive semidefinite (PSD) if for every vector v ∈ Rn vTKv ≥ 0, and ifa matrix is symmetric then it is PSD if and only if all its eigenvalues are non-negative.Since K is a covariance matrix, then it must be positive semidefinite. In the same way,we say that the kernel k is positive semidefinite if∫

k(x, x′)f(x)f(x′)dµ(x′) ≥ 0,

for all f ∈ L2(X , µ), where µ denotes a measure.

We are now going to list some of the most important and used kernel functions, takenfrom [1]. However, you will see that in Chapter 3 we will not use any of them, since wewill restrict our attention to a simpler case.

The squared exponential. The squared exponential kernel is defined as

kSE(x, x′) = exp

(−(x− x′)2

2l2

), (1.10)

with parameter l defining the characteristic length-scale. This covariance function isinfinitely differentiable, which means that the Gaussian process with this covariancefunction is very smooth. Moreover, it is the most widely-used kernel within the kernelmachine field [1].

The Matern class. The Matern class of covariance functions is given by

kMatern(x, x′) =21−ν

Γ(ν)

(√2ν(x− x′)2

l

)νKν

(√2ν(x− x′)2

l

), (1.11)

with positive parameters ν and l, where Kν is a modified Bessel function [1].

14

The γ − exponential. The γ-exponential kernel is given by

k(x, x′) = exp(−((x− x′)2/l)γ

)for 0 < γ ≤ 2, (1.12)

with l again positive parameter.

The Rational quadratic. The rational quadratic covariance function is defined as

kQR(x, x′) =

(1 +

(x− x′)4

2αl2

)−α, (1.13)

with α, l > 0.

Periodic. The periodic covariance function is especially used to model functions whichshow a periodic behaviour, and it is given by

kPER(x, x′) = exp

−2 sin2(π(x−x′)

p

)l2

, (1.14)

where p is the period of the function and l has the same meaning as for the squaredexponential (SE) kernel.

There are several other covariance functions, like linear, polynomial, dot-product, and ofcourse constant, but it is not of our interest in this thesis.

Instead, we can look at the free parameters of the covariance function, e.g. the char-acteristic length-scale l in (1.10) . In general we call them hyperparameters, since theyare parameters of the prior distribution (rather than the model distribution), but it willbe clearer later (chapter 2). Such hyperparameters can control ”the amount of noise ina regression model, the scale of variation in a regression function, the degree to whichvarious input variables are relevant, and the magnitudes of different additive componentsof a model” [14].Gaussian process regression (GPR) hyperparametrs can vary from two or three for a verysimple regression model up to several dozen or more for a model with many inputs and,in particular, they can influence the predictive value of the model [13]. There is plentyof scope for variation even inside only one family of covariance funtions. However, morethan exploring the effect of varying the GPR hyperparameters and see how these affectthe predictability of the model, we are interested in the estimate of the GPR hyperpa-rameters. Indeed, in the next chapter we will consider different methods for determingthe hyperparameters from the training data.

15

2. Estimating GPR hyperparameters

In chapter 1 we have seen how to do regression using Gaussian processes with a givencovariance function. In particular, we have considered some of the most important andused kernels, many of which depend on several hyperparameters whose values need to bedetermined. Since covariance functions control properties of prediction function, suchas smoothness, rate of variability, periodicity, we see how important it is to select andproperly determine the parameters.

In this chapter we are going to introduce two different approaches to estimation ofthe GPR hyperparameters from the training data: the fully Bayes and empirical Bayesmethods. In particular, in chapters 3 and 4, we will use the second approach, that is, bymaximizing the marginal likelihood function. Thus, we will explain why one is preferredover the other. Moreover, we will describe one of the major possible problems that onecan encounter, the existence of multiple local maxima, the main focus of this thesis.Finally, we will present some asymptotic properties of the empirical Bayes methods.

2.1. Two different approaches

In Gaussian process regression models the parameters need to be estimated from thetraining data. We can distingush two different approaches. The fully Bayes approach,which uses a hierarchical specification of priors, giving prior distributions to the hy-perparameters and the empirical Bayes (EB) approach, which estimates the prior fromthe data [11]. The EB method uses the maximum likelihood for estimating the hyper-parameters [18], namely the marginal likelihood as function of the hyperparameters ismaximized using optimization algorithms, such as the Newton or Conjugate Gradientalgorithm. In general, in the fully Bayes methods, predictions are made by averaging theposterior distribution for the hyperparameters, implementing e.g. Markov Chain MonteCarlo (MCMC) methods, that can perform without the need of estimating hyperparam-eters [14, 17] or, due to the high computational cost of MCMC methods, it estimatesthe GPR hyperparameters by means of maximum marginal likelihood [13].

However, ”common practice suggests that a fully Bayesian approach may not be worth-while” [15]. There are several reasons sampling kernel hyperparameters is not yet pop-ular, as empirical Bayes approach actually is. Some of them (taken from [15]) include:

• Historically, tuning (kernel) hyperparameters has not received much attention,even if they can significantly affect the performance and predictive value of theGPR model. Therefore, the literature is relatively ”poor” regarding this topic.

16

This is partly due to the fact that the more well-known SVM framework onlyprovides heuristics for kernel hyperparameters tuning, and these are not reallystudied and discussed in paper writing.

• Sampling methods are not typically as computationally efficient to implement asoptimization. Moreover, it is not clear which sampling method should be imple-mented for kernel hyperparameters, even if MCMC and slice sampling are twocommon methods.

• The marginal likelihood as function of the kernel hyperparameters is often peakedfor the kernels that are most widely used, such as the squared exponential (SE)kernel. Rasmussen (1996) [19], through some experimets using the SE kernel, foundno strong empirical advantages to sampling over marginal likelihood optimization.

However, there are areas, like Bayesian Optimization (see e.g. [28]) where samplingkernel hyperparameters has became really popular. Nonetheless, given also the previousreasons, we decided to adopt the empirical Bayes approach in our study, as you will seelater in chapters 3 and 4.

Let us now see the two different approaches more in detail, the discussion will be generalbut focused on the specific treatment of Gaussian process models for regression. Herewe refer to the GPR hyperparametrs as the covariance function parameters, however,it is possible to incorporate and treat in the same way the parameters of the noise dis-tribution (e.g. σ2

n in the regression model (1.1) where the additive noise has normaldistribution, ε ∼ N (0, σ2

n)) .

2.1.1. The fully Bayes approach

As we said before, in the fully Bayes approach it is common to use a hierarchical speci-fication of priors. At the lowest level is the parameter f (the prediction function can beseen as a parameter), and at the second level are the hyperparameters θ which controlthe distribution of f.Inference takes place one level at a time, on the posterior distribution. At the bottomlevel, by Bayes’s rule (A.4) the posterior distribution is given by

p(f|Y,X, θ) =p(Y |X, f)p(f|θ)p(Y |X, θ)

, (2.1)

where p(f|θ) is the prior distribution of f and p(Y |X, f) the likelihood function. Theprior, as a probability distribution, represents the knowledge of the parameter f thatwe have before seeing the data. The posterior combines information from the prior andthe data (through the likelihood). The probability density p(Y |X, θ) is the normalizingconstant and we see that it does not depend on the parameter f, and it is given by

p(Y |X, θ) =

∫p(Y |X, f)dp(f|θ). (2.2)

17

At the second level, in the same way, the posterior distribution of the hyperparametersθ is given by

p(θ|Y,X) =p(Y |X, θ)p(θ)

p(Y |X), (2.3)

where p(θ) is the prior distribution of the hyperparameters, often called hyperprior. Thenormalizing constant is given by

p(Y |X) =

∫p(Y |X, θ)p(θ)dθ. (2.4)

We see that the use of a fully Bayes approach implies the implementation of severalintegrals. Depending on the model, some of them may not be analytically tractable andtherefore, they could require analytical approximations or MCMC methods. Moreover,the behaviour of Gaussian process regression models is extremely sensitive to the choiceof the kernel, and therefore, it is important to properly choose it and determine thehyperparameters. By contrast, overall prior distributions of the hyperparameters havelittle impact on the performance of GPR models, which implies that simple priors,like e.g. the uniform distribution in an appropriate interval, may be sufficient in GPRmodelling in terms of predictability [13].

2.1.2. The empirical Bayes approach

Bayesian analysis, in addition to the likelihood function, depends on prior distributionsfor the model parameters. This prior, which in our case is the prior of the predictivefunction f, can be parametric or nonparametric, depending on other parameters whichcan be drawn from other second-level prior. This sequence of parameters and prior dis-tributions constitutes a hierarchical model. The hierarchy stops at some point, assumingall the left prior parameters known. The empirical Bayes approach, rather than assum-ing this, uses the observed data to estimate these final stage parameters, proceeding asif the prior distributions were known (for more details see [2]).

In particular, in Gaussian process regression, once a prior distribution for f has beenchosen, and so its covariance function, the hyperparameters of this, θ, will not be givenany prior distribution. The same will be done for the parameter of the additive noise ε,i.e. σ2

n, it will not be given any prior distribution. Thus, the estimation of the hyper-parameters proceeds as follows. The common approach is to estimate them by meansof maximum marginal likelihood, namely find the maximum of the marginal likelihoodas function of the hyperparameters θ. Recall the model (1.1), and the assumptions wemade in chaper 1,

yi = f(xi) + εi, for every i = 1, . . . , n, (2.5)

where the additive noise ε is such that

εi ∼ N (0, σ2n), (2.6)

18

and the joint distribution of f(x) : x ∈ X is given by the law

f(x) ∼ N (m(x), k(x, x′)), (2.7)

for every x, x′ ∈ X . Hence, if we then look at the vectors ε and f, and suppose thatm(x) = 0 for every x ∈ X , we have

ε ∼ N (0, σ2nI) and f ∼ N (0,K), (2.8)

where I is the n × n identity matrix and K the covariance function of the Gaussianprocess f, such that each matrix entries is given by Kij = k(xi, xj), for every xi, xj ∈ X .Therefore, from (2.5) and (2.8), the distribution of the outputs Y is given by

p(Y |X, θ) ∼ N (0,Σθ), (2.9)

where Σθ is the covariance function depending on the unknown hypeparameters θ andgiven by Σθ = K + σ2

nI. Thus, the log marginal likelihood is given by,

logL(θ) = −n2

log 2π − 1

2log |Σθ| −

1

2Y TΣ−1

θ Y, (2.10)

where we denote with L(·) the marginal likelihood as function of the hyperparameters θ.We can see three different terms in (2.10). The only one which depends on the observedtargets Y is given by −1

2YTΣ−1

θ Y, −12 log |Σθ| is the complexity penalty depending only

on the covariance function Σθ and the inputs X, and finally −n2 log 2π is a normalization

constant [1]. We then estimate the hyperparameters setting the derivatives of (2.10)with respect to θ. Using (A.11) and (A.12), we get

∂

∂θilogL(θ) =

1

2Y TK−1 ∂

∂θiK−1Y − 1

2Tr

(K−1 ∂

∂θiK

). (2.11)

The complexity of computing the marginal likelihood (2.10) is mainly due to the inversematrix K−1. Standard methods for matrix inversion of positive definitive symmetric ma-trices, as K, require time O(n3) for a n×n matrix. Once K−1 is known, the computationof the derivatives in (2.11) requires time O(n2) per hyperparameter (see [1]).

If the marginal likelihood is not analytically tractable, then estimating the hyperpa-rameters can become difficult. Perhaps the most successful approach has been to ”ap-proximate the marginal likelihood by the use of e.g. Laplace, EP or variational approx-imations and then optimize the approximate marginal likelihood with respect to thehyperparameters” [15]. Another way could be the implementation of the ExpectationMaximization (EM) algorithm, which is an ascent method for maximizing the likelihood,but it is only guaranteed to converge to a stationary point of the likelihood function (see[2, 20]).

19

2.2. Existence of multiple local maxima

The marginal likelihood can be multimodal as a function of the hyperparameters. In par-ticular, the choice of the covariance function can influence the convexity of the marginallikelihood as well. Thus, there are cases where the likelihood is not convex with respectto the hyperparameters and therefore the optimisation algorithm may converge to a lo-cal optimum whereas the global one could provide better results [16]. In particular, theoptimized hyperparameters may not be the global optima [15, 17]. In fact there is noguarantee that the likelihood function does not suffer of multiple local optima.A possible way to tackle this issue then could be the use of multiple starting pointsrandomly selected from a specific prior distribution and after convergence, choose themaximum points with higher marginal likelihood function value as the estimates [13],or, for example in data with periodic structure, all the parameters which were part ofthe previous kernel are initialized to their previous values [16]. Anyway, care should betaken not to end up with a bad local maximum.

Therefore, we have seen that in general while estimating the GPR hyperparametersthrough optimisation of the marginal likelihood, the existence of local optima representsa possible major problem which actually received very little attention in the literature.Thus, what we are going to do next is to consider a special case of Gaussian processregression model, where a covariance function which depends only on one parameter istaken. We will show that even in this very simple case the problem of local optima canoccurr, and our final goal is to find a systematic method to find all the local optima ofthe marginal likelihood and in particular the global one.

Finally, we can see, figure 2.1, taken from [1], shows an example of the marginal likeli-hood as function of the hyperparameters with two local maxima. Every local maximumcorresponds to a particular interpretation of the data [1]. One optimum correspondsto a relatively complicated model with low noise, whereas the other corresponds to amuch simpler model with more noise. Figure (a) shows the marginal likelihood as afunction of the hyperparameters l, the characteristic length-scale, and σ2

n, the varianceof the additive noise ε, in the GPR model with SE kernel given by (1.10), for a datasetof 7 observations. We see that there are two local optima indicated by ’+’. The globaloptimum corresponds to low noise and short length-scale, while the local optimum hashigh noise and long length-scale. In (b) and (c) the inferred underlying functions (with95% confidence bands) are shown for each of the two solutions [1]. We can think ofthe first figure in (b) as a overfitting model, indeed we only have 7 observations and 2hypeparametrs. Here residuals are almost zero, explaining also the noise. While in (c)there seems to be misspecification. The residuals are large, not explaining enough.

20

Figure 2.1.

21

2.3. Asymptotic behaviour of the empirical Bayes methods

After having described the empirical Bayes and fully Bayes approaches, a natural ques-tion that comes along is whether nonparametric Bayesian procedures ”work”. In ourcase, we are now interested in looking at the asymptotic behaviour of the empirical Bayesmethods. The discussion will be general, but the adaptation to our case easily follows.

Let us suppose we have observations Xn that are distributed accordingly to Pnθ , con-ditionally to some parameter θ ∈ Θ (in the GPR model the parameter would be thepredictive function f). Let Π(·|λ), λ ∈ Λ be a family of prior distributions on Θ, withλ hyperparameter (in the GPR model the family of prior distributions is the Gaussianprocess and λ are the GPR hyperparameters). We adopt the ”frequentist” view, namelywe assume that the observations are generated accordingly to a true parameter, say θ0,and we want to see if increasing the number of observations indefinitely, the posteriordistribution is able to recover the parameter. We then need two definitions, i.e. posteriorconsistency and rate of contraction (both taken from [11]).

Definition 2.1. Let (Θ, d) be a semimetric space. The posterior distribution Π(·|Xn, λ)is said to be (weakly) consistent at θ0 ∈ Θ if Π(θ : d(θ, θ0) > ε|Xn)→ 0 in Pθ0-probability,as n → ∞, for every ε > 0. The posterior is said to be strongly consistent at θ0 ∈ Θ ifthe convergence is in the almost sure sense.

Definition 2.2. Let (Θ, d) be a semimetric space. The posterior distribution Π(·|Xn, λ)is said to contract at rate εn → 0 at θ0 ∈ Θ if Π(θ : d(θ, θ0) > Mnεn|Xn) → 0 inPθ0-probability, for every Mn →∞ as n→∞.

This last definition roughly means that the posterior distribution concentrates aroundθ0 on balls of radius of the order εn, and the contraction rate can be seen as a naturalrefinement of consistency [11].If the model is parametric, it has been shown that the maximum marginal likelihoodestimator (MMLE) converges asymptotically to some oracle value λ0 which maximizesin λ the prior density calculated at the true value θ0 of the parameter, π(θ0|λ0) =supπ(θ0|λ) : λ ∈ Λ, where the density is with respect to Lebsegue measure [30].However, it is not possible to extend this to nonparametric model, since for instance, theprior distributions Π(·|λ), λ ∈ Λ typically are not absolutely continuous with respect to afixed measure [29]. In [22] a special case has been studied. They focus on a particular caseof the Gaussian white noise model and they consider an infinite dimensional Gaussianprior with fixed regularity parameter α and scaling hyperparameter τ. They find outthat for a given base prior and a given true regularity level, there exists an optimalscaling rate. However, EB methods fail to follow the true value if the regularity of thetrue parameter exceeds an even lower band. On the other side, it comes out that EBapproach ”works adequately if the base prior does not undersmooth the true parameter”[22]. More general studies are conducted in [29]. They study the asymptotic behaviourof the MMLE and they derive posterior contraction rates. Previously [31] providedsufficient conditions for deriving general EB posterior contraction rates when it is known

22

that the MMLE belongs to a chosen subset Λ0 of Λ. Their result was given by basicallycontrolling

supλ∈Λ0

Π(d(θ, θ0) > εn|Xn, λ).

Then in [29], two further important results for the asymptotic properties of the EBmethods are given:

• they characterize Λ0 as

Λ0 = λ : εn(λ) ≤Mnεn,0

for any sequence Mn going to ∞, with εn,0 = infεn(λ) : λ ∈ Λ and εn satisfying

Π(||θ − θ0|| ≤ Kεn(λ)|λ) = e−nε2n(λ),

with (Θ, || · ||) Banach space and for some large enough constant K;

• they reveal the exact posterior contraction rates for every θ0 ∈ Θ. Namely, theyprove that the concentration rate of the MMLE empirical Bayes posterior distri-bution is of order O(Mnεn,0), and they show that the posterior contraction rate isbounded from below by δnεn,0, for arbitrary δn = o(1).

Finally, and interestingly, comparing the fully Bayes approach that we have presentedat the beginning of this chapter and the EB approach, they showed that empirical Bayesand fully Bayes methods behave similarly. Indeed, fully Bayes posterior has the sameupper and lower bounds on the contraction rate for every θ0 ∈ Θ.

Let us now go back to the Gaussian process regression model we have discussed inchapter 1 and the problem of the existence of multiple local maxima. Here the fam-ily of prior distributions is given by the Gaussian processes with a specific covariancefunction and the hyperparameters are now given by θ instead of λ. Then, we see thatunder certain conditions, for instance θ ∈ Λ0 for specific Mn and εn,0, as n number of theobservations grows indefinitely, the posterior distribution will concentrate close aroundthe global maximum, with contraction rates of the order O(Mnεn,0), and bounded frombelow by δnεn,0, for arbitrary δn = o(1). In particular, as in the case of figure 2.1, ifthe marginal likelihood, as function of the GPR hyperparameters, admits two maxima(one local and one global maximum), we have that as n goes to ∞, there will still ex-ist two local maxima but the posterior distribution will concentrate around the globalmaximum, and as we can see the local maximum brings to a misspecified model whilethe global maximum to a overfitted model.

23

3. Regression with a centered Brownianmotion prior

So far we have introduced the general Gaussian Process regression model, how it canbe characterized and how it depends on the hyperparameters. In particular, we haveseen that the Gaussian process predictor f(x) relies on the covariance function, whichencodes our assumptions about the function we wish to learn and which depends onfree parameters, called GPR hyperparameters, that we want to estimate. We then havelooked at different approaches with which we can estimate them, and finally we haveintroduced one of the major possible issues, the main focus of this thesis, the existenceof multiple local maxima in the EB maximization problem.

We now restrict our attention to a specific model, with a specific covariance functionand we are going to study this more in detail. Indeed, the kernel we chose is exactlythe kernel of a centered Brownian motion. The reason why we did it is to make theGaussian process regression model as simplest as possible (this kernel does not dependon any other parameters). Then, the chapter will be focused on the theoretical, and soanalytical, study of the posterior density as function of the hyperparameters. There isgoing to be a first introductory part, where many things will be repeated from chapter1, followed by the main results, and their proofs.

3.1. Main results

Let us consider the following model,

Y = µ+ f(X) + ε, (3.1)

where X is the input vector, µ is a constant, f is the function value and Y is the vector ofthe observed target values. We have assumed that the observed values Y differ from thefunction values f(X) by a constant µ and additive noise ε, and we will further assumethat this noise follows an independent, identically distributed Gaussian distribution,with zero mean and variance eβ

εi ∼ N (0, eβ), for i = 1, . . . , n. (3.2)

Moreover, we need a restriction on f, namely Ef(x) = 0 for every x ∈ X . Hence, weassume that f has Brownian motion prior distribution, with scale parameter α, givenby

f ∼ N (0, eαH), (3.3)

24

where f := f(X) and H is the covariance kernel matrix of a centered Brownian motion.

The noise assumption (3.2) together with the model (3.1) gives rise to the probabil-ity density function, which is factored over cases in the training set (because of theindependence assumption) to give

p(Y |X, f) =n∏i=1

p(yi|xi, f(xi)) =n∏i=1

1√2πeβ

exp(− (yi − (f(xi) + µ))2

2eβ)

(3.4)

=1

(2πeβ)n/2exp

(− 1

2eβ|Y − (f(X) + µ)|2

)= N (f(X) + µ, eβI). (3.5)

Furthermore, by Bayes’ rule (A.4) we have

p(f|X,Y ) =p(Y |f, X)p(f)

p(Y |X), (3.6)

where the normalizing constant is independent of f and given by

p(Y |X) =

∫p(Y |X, f)dp(f).

In particular we havep(f|X,Y ) ∝ p(Y |f, X)p(f). (3.7)

Hence, writing only the terms which depend on f, and ”completing” the square, weobtain from (3.7)

p(f|X,Y ) ∝ exp(− 1

2eβ(Y − (µ+ f))T (Y − (µ+ f))

)exp

(− 1

eαfTH−1f

)∝ exp

(− 1

2(f− f)T (

1

eβI +

1

eαH−1)(f− f)

),

where f = e−β(e−βI + e−αH

)−1(Y − µ), and we recognize the form of the posterior

distribution as Gaussian with mean f and covariance matrix A−1

p(f|X,Y ) = N (f =1

eβA−1(Y − µ), A−1), (3.8)

where A = e−βI + e−αH−1.We see that (3.8) depends on the parameters α, β and µ. We can estimate them bymaximizing the marginal log-likelihood function, namely with their maximum likelihoodestimators (MLE). If for µ it is quite easy to find it (it is given by Y ), for the other twoparameters it gets more difficult.

Let us suppose to have a dataset of n observations, D = (xi, yi) : i = 1, . . . , n. Weconsider again the following model

yi = µ+ f(xi) + εi, (3.9)

25

with same assumptions for f and ε that we made before.Thus, we have that

(y1, ..., yn) ∼ N (µ,Kh,α,β) (3.10)

has a multivariate normal distribution with mean µ and covariance matrix given byKh,α,β = eαH + eβI, where H is the covariance kernel matrix of a centered Brownianmotion, I is the identity matrix and α, β are scalar parameters. Let us replace µ by itsMLE µ, namely µ = Y = 1

n

∑ni=1 yi. What we are going to do next is considering the

profile likelihoodL(α, β) := L(α, β, µ)

as a function of α and β, and now the main goal is to find the local maxima of the profilelog-likelihood, logL(α, β).

If we consider as kernel h of the centered Brownian motion,

h(x, x′) = −1

2(‖ x− x′ ‖ − 1

n

∑j

‖ x− xj ‖ −1

n

∑i

‖ xi − x′ ‖ +1

n2

∑i

∑j

‖ xi − xj ‖)

(3.11)we get the following plot of the profile log-likelihood logL(α, β).

Figure 3.1.: Random log-likelihood for n=5

As we can see, figure 3.1 shows the function having two ridges and it looks like the topshave straight line asymptotes. It would be then interesting for our main purpose to ob-tain their equations. Note that we are going to use this kernel (3.11) for chapters 3 and 4.

Let us consider again the model (3.9). Suppose we have n observations. Hence, weget

L(α, β) =1√

(2π)n|Kh,α,β|exp

(− 1

2(Y − Y )T (Kh,α,β)−1(Y − Y )

)(3.12)

26

and, using equations (A.6), (A.8) and (A.10),

logL(α, β) = −n2

log 2π − 1

2log |K−1

h,α,β| −1

2y∗

TK−1h,α,βy

∗ (3.13)

= −n2

log 2π − 1

2

∑i

log(eαλi + eβ)− 1

2

∑i

( (aTi y∗)2

eαλi + eβ)

(3.14)

where again Kh,α,β = eαH + eβI, Y = (y1, . . . , yn), y∗ is the vector defined as y∗ =(y1 − Y , . . . , yn − Y ), λi and ai are the eigenvalues and eigenvectors, respectively, of H,(see B).As we said before, the main goal is to find, if existing, the local maxima and the globalmaximum of (3.13). Thus, the first thing we should check is the exsistence of stationarypoints. Let us consider the general case n > 2 (we will see later the case n = 2 ). By(A.11) and (A.12), we get

∂

∂αlogL(α, β) = −1

2

∂

∂α

(|Kh,α,β|

)− 1

2

∂

∂α

(y∗TK−1

h,α,βy∗)

= −1

2Tr(K−1h,α,β

∂

∂αKh,α,β

)− 1

2y∗T(−K−1

h,α,β

∂Kh,α,β

∂αK−1h,α,β

)y∗

and

∂

∂βlogL(α, β) = −1

2

∂

∂β

(|Kh,α,β|

)− 1

2

∂

∂β

(y∗TK−1

h,α,βy∗)

= −1

2Tr(K−1h,α,β

∂

∂βKh,α,β

)− 1

2y∗T(−K−1

h,α,β

∂Kh,α,β

∂βK−1h,α,β

)y∗.

Furthermore, we have

∂

∂αKh,α,β =

∂

∂α(eαH + eβI) = eαH =

∑i

eαλiaiaTi ,

and in particular

K−1h,α,β

∂

∂αKh,α,β =

∑i

eαλieαλi + eβ

.

Thus,

Tr(K−1h,α,β

∂

∂αKh,α,β

)=∑i

( eαλieαλi + eβ

)(3.15)

and

−K−1h,α,β

∂Kh,α,β

∂αK−1h,α,β = −

∑i

eαλi(eαλi + eβ)2

aiaTi . (3.16)

27

Therefore, from (3.15) and (3.16) we get

∂


2Tr(

(eαH + eβI)−1eαH)

− 1

2y∗T(− (eαH + eβI)−1eαH(eαH + eβI)−1

)y∗

= −eα 1

2Tr(

(eαH + eβI)−1H)

− eα 1

2y∗T(− (eαH + eβI)−1H(eαH + eβI)−1

)y∗

hence, in particular

∂


2

∑i

( eαλieαλi + eβ

)+

1

2y∗T

∑i


aiaTi y∗. (3.17)

Similarly for β, we get

∂


2Tr(

(eαH + eβI)−1eβI)

− 1

2y∗T(− (eαH + eβI)−1eβI(eαH + eβI)−1

)y∗

= −eβ 1

2Tr(

(eαH + eβI)−1)

− eβ 1

2y∗T(− (eαH + eβI)−1(eαH + eβI)−1

)y∗

and in particular

∂


2

∑i

( eβ

eαλi + eβ

)+

1

2y∗T

∑i

eβ

(eαλi + eβ)2aia

Ti y∗. (3.18)

If we now set both (3.17) and (3.18) equal to zero, we get for α and β respectively,∑i

eαλieαλi + eβ

=∑i


(aTi y∗)2 (3.19)

∑i

eβ

eαλi + eβ=∑i

eβ

(eαλi + eβ)2(aTi y

∗)2. (3.20)

Investigating whether (3.19) and (3.20) hold gets complex, specifically for big n, andestablishing whether there are stationary points seems analytically not possible. There-fore, as the second derivatives are even more complicated, as well as checking the neg-ative/positive definiteness of the Hessian matrix, in order to understand better thebehaviour of (3.13) we can investigate the asymptotic behaviour of the function. Let usgive then the following proposition.

28

Proposition 1. Let us consider the regression model given by (3.9), and in particularthe profile log-likelihood function (3.13). Then

i) for n = 2, with probability one the function (3.13) goes to −∞ in all directions,except for

α→ −∞, β fixed, logL(α, β)→ −n2

log 2π − n

2β − 1

2

n∑i=1

(aTi y∗)2

eβ,

β → −∞, α fixed, logL(α, β)→ +∞and if we look at logL(uγ + c, vγ + d) for u, v, c, d, γ ∈ R, with u, v, c, d fixed,

if (n-1)u+v < 0, logL(uγ + c, vγ + d)→∞if (n-1)u+v = 0, logL(uγ + c, vγ + d)→ 0

ii) for n > 2 with probability one, the function (3.13) goes to −∞ in all directions,except for

as α→ −∞, β fixed, logL(α, β)→ −n2log2π − n

2β − 1

2

∑i

(aTi y∗)2

eβ(3.21)

as β → −∞, α fixed, logL(α, β)→ −n2log2π − n

2α− 1

2

∑i

(aTi y∗)2

eαλi+

1

2

n∑i=1

log λi.

(3.22)

Moreover, (3.13) admits two asymptotes, whose equations are given by

(α, β∗) : α ∈ R (3.23)

(α∗, β) : β ∈ R (3.24)

where β∗ and α∗ are given by

β∗ = log( 1

n

n∑i=1

(aTi y∗)2)

(3.25)

α∗ = log( 1

n

n∑i=1

(aTi y∗)2

λi

). (3.26)

What we still want to investigate is whether and how many local maxima the function(3.13) has. From the above proposition we see that with probability 1, when n > 2,(3.13) goes to −∞ in almost all directions, hence, since it is continuous, it has at leastone local maximum, probably on the ridge. In fact, one interesting case is given forn = 6. Plotting the function (figure 3.2) we see that it admits at least one stationarypoint around the cusp. Therefore, if we consider proposition 1 and what we have saidbefore, with probability bigger than 0 the function will have at least two local maxima.

The case n = 2 is different. The function can reach ∞. Let us now have a look atthis special case. Here things get easier and it is possible to directly compute what wedid so far for the general case.

29

Figure 3.2.: Random log-likelihood with n = 6

3.1.1. The case n=2

Let us consider the case with n = 2. Hence for i = 1, 2 we have the following

(y1, y2) ∼ N2(µ,Kh,α,β). (3.27)

Again, let us consider the profile likelihood L(α, β) := L(α, β, µ), where µ is the MLEof µ, namely µ = Y = 1

2

∑2i=1 yi. If we plot the profile log-likelihood logL(α, β) as a

function of α and β, we get

logL(α, β) = − log 2π − 1

2log |Kh,α,β| −

1

2

(y1 − Y , y2 − Y

)K−1h,α,β

(y1 − Y , y2 − Y

)T(3.28)

Figure 3.3.: Random log-likelihood for n = 2

In this case it is easy to directly compute the profile log-likelihood, indeed we have

logL(α, β) = −eβ + log

(eα + 4eβ

)+ 2 log π

2− (y∗1 − y∗2)2

eα + 4eβ(3.29)

where y∗i = yi − y for i = 1, 2, and analytically the function is easier to study. Let usgive the following proposition.

30

Proposition 2. Letyi = µ+ f(xi) + εi,

for i = 1, 2, where µ is a constant, f is a realization of a centered Brownian motion withscale parameter eα and ei ∼ N (0, eβ). The function logL(α, β) given by (3.29) admitsno stationary point. Furthermore, we have

i) (3.29) has two ridges, whose equations are given by

(α, βα, logL(α, β))

(αβ, β, logL(αβ, β)),

where

βα = log( 1

16(−3eα ± 2

√e2α − 12eα(y∗1 − y∗2)2 + 4(y∗1 − y∗2)4 + 2(y∗1 − y∗2)2)

)αβ = log(2(y∗1 − y∗2)2 − 4eβ)

with y∗i = yi − Y for i = 1, 2.

ii) The above ridges have asymptotes, whose equations are given by

β = log((y∗1 − y∗2)2

4

), for the ridge βα

β = log((y∗1 − y∗2)2

2

)and α = log(2(y∗1 − y∗2)2), for the ridge αβ.

iii) (3.29) is concave for the values of α and β such that

eα + 4eβ > 2(y∗1 − y∗2)2. (3.30)

31

Figure 3.4.: Random log-likelihood for n = 2, the blue region is where the function isconcave, α, β ∈ [−10, 10]

3.2. Proofs

Proof of proposition 1. We will start showing ii) and then it will follow i). We have that(see B)


log 2π − 1

2

∑i

log(eαλi + eβ)− 1

2

∑i

( (aTi y∗)2

eαλi + eβ

).

Critical points are given when eαλi + eβ ≤ 0, as the logarithm is not well defined. Sinceλi for i = 1, ..., n are eigenvalues of the covariance kernel matrix of a centered Brownianmotion, we know that they all are nonnegative. Hence we can assume that eαλi+e

β > 0.We have

• as α→∞, β fixed, logL(α, β)→ −∞;

• as α→ −∞, β fixed,

logL(α, β)→ −n2log2π − n

2β − 1

2

∑i

(aTi y∗)2

eβ:= l∗ > −∞; (3.31)

• as β →∞, α fixed, logL(α, β)→ −∞;

• as β → −∞, α fixed,

logL(α, β)→ −n2log2π − n

2α− 1

2

∑i

logλi −1

2

∑i

(aTi y∗)2

eαλi:= l∗∗ > −∞. (3.32)

32

Now let us have a look at logL(uγ + c, vγ + d) where γ, u, v, c, d ∈ R and u, v, c, d arefixed. Hence

logL(uγ + c, vγ + d) = −n2

log 2π − 1

2

∑i

log(euγ+cλi + evγ+d)− 1

2

∑i

(aTi y∗)2

euγ+cλi + evγ+d.

Therefore,

• suppose u, v > 0 fixed, then

limγ→∞

logL(uγ + c, vγ + d) = −∞,

limγ→−∞

logL(uγ + c, vγ + d) = −∞;

• suppose u > 0 and v < 0 fixed, then

limγ→∞

logL(uγ + c, vγ + d) = −∞,

limγ→−∞

logL(uγ + c, vγ + d) = −∞.

Thus, we see that logL(α, β) goes to −∞ in all directions, but (3.32) and (3.31). Now,let us suppose that there exists an i such that λi = 0, and without loss of generality wecan assume that λ1 = 0. Then,


log 2π − 1

2

∑i 6=1

log(eαλi + eβ)− β

2− 1

2

∑i 6=1

(aTi y∗)2

eαλi + eβ− 1

2

(aT1 y∗)2

eβ.

Thus,

• as α→∞, β fixed, logL(α, β)→ −∞;

• as α→ −∞, β fixed, logL(α, β)→ −n2 log 2π − n

2β −12

∑i

(aTi y∗)2

eβ> −∞;

• as β →∞, α fixed, logL(α, β)→ −∞;

• as β → −∞, α fixed,

if (aT1 y∗)2 = 0, logL(α, β)→ +∞

if (aT1 y∗)2 6= 0, logL(α, β)→ −∞

Again, if we look at logL(uγ + c, vγ + d), we get

• if u, v > 0 as γ →∞ logL(uγ + c, vγ + d)→ −∞;

• if u, v < 0 as γ →∞ logL(uγ + c, vγ + d)→ −∞;

• if u < 0, v > 0 as γ →∞ logL(uγ + c, vγ + d)→ −∞;

33

• if u > 0, v < 0 and (aT1 y∗)2 6= 0, as γ →∞ logL(uγ + c, vγ + d)→ −∞

• if u > 0, v < 0 and (aT1 y∗)2 = 0, as γ →∞

if (n-1)u+v > 0 logL(uγ + c, vγ + d)→ −∞if (n-1)u+v < 0 logL(uγ + c, vγ + d)→∞if (n-1)u+v = 0 logL(uγ + c, vγ + d)→ 0.

Therefore, we see what changes in the asymptotic behaviour of logL(α, β) if at least oneeigenvalue of H is zero. In particular, if n > 2 then P(aTi y

∗ = 0) = 0 by continuity ofaTi y

∗ for every i = 1, . . . , n. Thus for n > 2 the function goes to −∞ in all directions,except for (3.31) and (3.32), yielding ii). While, we see for n = 2, by equation (3.29),with probability bigger than zero there exists i such that aTi y

∗ = 0, yielding i).

To complete the proof we only need to find the equations of the asymptotes. Look-ing at (3.31) and (3.32), we can maximize them with respect to β and α, respectively.We get

∂

∂βl∗ = −n

2+

1

2

n∑i=1

(aTi y∗)2

eβ

∂

∂αl∗∗ = −n

2+

1

2

n∑i=1

(aTi y∗)2

λieα

which are equal to zero for, respectively,

β∗ = log( 1

n

n∑i=1

(aTi y∗)2)

(3.33)

α∗ = log( 1

n

n∑I=1

(aTi y∗)2

λi

). (3.34)

These two define in particular the asymptotes of the function logL(α, β), whose equa-tions are given by

(α, β∗) : α ∈ R (3.35)

and(α∗, β) : β ∈ R, (3.36)

which finishes our proof.

Proof of proposition 2. Firstly, we need to show that (3.29) has no stationary points.For n = 2, we can directly compute the derivatives, yielding

∂

∂αlogL(α, β) = −

eα(eα + 4eβ − 2 (y∗1 − y∗2) 2

)2 (eα + 4eβ)

2

∂

∂βlogL(α, β) =

8eβ (y∗1 − y∗2) 2 −(eα + 4eβ

) (eα + 8eβ

)2 (eα + 4eβ)

2 .

34

If we set both the derivatives with respect to α and β equal to zero, it is easy to see thatthere is no solution.Let us now prove i) and ii). Firstly, let us define

βα := arg maxβ

logL(α, β)

= arg maxβ

(−(eα + 4eβ

) (log(eβ(eα + 4eβ

))+ 2 log(π)

)+ 2y∗1

2 − 4y∗2y∗1 + 2y∗2

2

2 (eα + 4eβ)

).

If we compute the derivative with respect to β and we set this equal to zero, we get

βα1 = log( 1

16(−3eα +

√e2α − 12eα(y∗1 − y∗2)2 + 4(y∗1 − y∗2)4 + 2(y∗1 − y∗2)2)

)βα2 = log

( 1

16(−3eα −

√e2α − 12eα(y∗1 − y∗2)2 + 4(y∗1 − y∗2)4 + 2(y∗1 − y∗2)2)

)Both are well defined for α ≤ log[(6 − 4

√2)(y∗1 − y∗2)2]. If we let α → −∞, we can see

that βα1 has an horizontal asymptote given by β = log( (y∗1−y∗2)2

4

).

Secondly, let us define

αβ := arg maxα

logL(α, β)

= arg maxα

(−(eα + 4eβ

) (log(eβ(eα + 4eβ

))+ 2 log(π)

)+ 2y∗1

2 − 4y∗2y∗1 + 2y∗2

2

2 (eα + 4eβ)

).

If we compute the derivative with respect to α and we set this equal to zero, we get

αβ = log(

2(y∗1 − y∗2)2 − 4eβ),

which is well defined for β < log(

12(y∗1 − y∗2)2

)and it has a vertical asymptote and a

horizontal asymptote, whose equations are given, respectively, by

β = log(1

2(y∗1 − y∗2)2

),

α = log(

2(y∗1 − y∗2)2)

if we let β → −∞.

Finally, let us prove the last point, iii). A function is concave when the Hessian matrixH is semi-definite negative, i.e. if for every vector x ∈ Rn xHx≤ 0, or equivalently if allthe eigenvalues of H are nonpositive. In this case, when n = 2, it is possible to directly

35

compute the eigenvalues of the Hessian matrix φi, for i = 1, 2. Indeed we have

φ1 = −(y∗1 − y∗2)2 (eα − 4eβ

)22 (eα + 4eβ)

3

−((y∗1 − y∗2)4 (224e2(α+β) + e4α + 256e4β

)− 128 (y∗1 − y∗2)2 e2(α+β)

(eα + 4eβ

)2 (eα + 4eβ)

3

+16e2(α+β)

(eα + 4eβ

)2)1/2

2 (eα + 4eβ)3 −

4eα+β(eα + 4eβ

)2 (eα + 4eβ)

3

φ2 =(y∗1 − y∗2) 2

(−(eα − 4eβ

)2)2 (eα + 4eβ)

3

+((y∗1 − y∗2)4 (224e2(α+β) + e4α + 256e4β

)− 128 (y∗1 − y∗2)2 e2(α+β)

(eα + 4eβ

)2 (eα + 4eβ)

3

+16e2(α+β)

(eα + 4eβ

)2)1/2

2 (eα + 4eβ)3 −

4eα+β(eα + 4eβ

)2 (eα + 4eβ)

3 .

We see that φ1 is negative, while φ2 can be negative only for specific values of α and β.In particular, we get that φ2 < 0 if and only if eα + 4eβ > 2(y∗1 − y∗2)2. Indeed we seethat the first term of φ2 is negative, the third one as well and the second one is positive.Thus φ2 can be negative if and only if the sum of the two negative terms is in absolutevalue bigger than the third one, namely if and only if

(y∗1 − y∗2)4(eα − 4eβ)4 + 8(y∗1 − y∗2)2(eα − 4eβ)2eα+β(eα + 4eβ)

+ 16e2α+2β(eα + 4eβ)2 >

(y∗1 − y∗2)4(224e2α+2β + e4α + 256e4β)− 128(y∗1 − y∗2)2e2α+2β(eα + 4eβ)

+ 16e2α+2β(eα + 4eβ)2

if and only if

(y∗1 − y∗2)4(eα − 4eβ)4 + 8(y∗1 − y∗2)2(eα − 4eβ)2eα+β(eα + 4eβ) >

(y∗1 − y∗2)4(224e2α+2β + e4α + 256e4β)− 128(y∗1 − y∗2)2e2α+2β(eα + 4eβ)

if and only if

− 16(y∗1 − y∗2)4eα+β(eα + 4eβ)2 + 8(y∗1 − y∗2)2eα+β(eα + 4eβ)3 > 0

if and only if

8(y∗1 − y∗2)2eα+β(eα + 4eβ)2(−2(y∗1 − y∗2)2 + eα + 4eβ) > 0

if and only if

eα + 4eβ > 2(y∗1 − y∗2)2.

Therefore, we see that for eα+4eβ > 2(y∗1−y∗2)2 the Hessian matrix admits two negativeeigenvalues, hence it is negative definite and in particular the function logL(α, β) isconcave, and this concludes our proof.

36

4. Experimental Results

In the previous chapter we have obtained theoretical results, but it is still not possibleto solve our main problem, whether the function L(α, β), given by (3.13), has one ormultiple local maxima. We could not even determine the existence of stationary pointsfor n > 2. Therefore, since this local problem of optimization can not be solved using theclassical solution given by mathematical analysis, which involves the use of the gradi-ent and Hessian matrix (namely analytical procedures), we then need numerical analysis.

Our goal in this chapter is to verify the existence of multiple local maxima and globalmaximum of the function logL(α, β), and, most of all, to find a systematic methodthat can detect the local optima and the global optimum of the function, if possible.Moreover, we want to simulate randomly and estimate the probability of logL(α, β)having one, or more local maxima. In order to do that, we use the software WolframMathematica 11.2, which implements several maximization methods.

4.1. Methods

The Wolfram Language has a collection of commands that perform unconstrained op-timization, like e.g. FindMaximum or NMaximize. What we used the most is thecommand FindMaximum, which tries, implementing numerical methods, to find the lo-cal maxima of the objective function f. It does so by doing a search starting at someinitial value, let us call it x0, generating a sequence of iterates xk∞k=0 that terminatewhen a solution point has been approximated with sufficient accuracy or when it seemsthat no more progress can be made. In deciding how to move from xk to xk+1, the algo-rithm uses information from the previous iterates and fk := f(xk), such that every stepincreases the height of the function. In contrast, the function NMaximize tries harderto find the global maximum, but it has much more work to do and thus, it generallytakes much more time than FindMaximum, which is often preferred to the other one forglobal optimization problems as well.

When using FindMaximum, if we do not specify any method, Mathematica uses theQuasi-Newton method, unless the problem is structurally a sum of squares, in whichcase the Levenberg-Marquardt variant of the Gauss-Newton method is used, or, if wegive an initial point, the Principal Axis method is used. However, Mathematica has fouressentially different methods (we are excluding the Levenberg-Marquardt since we donot have a function sum of squares), namely

• Newton Method It uses ”the exact Hessian matrix or a finite difference approx-

37

imation if the symbolic derivative cannot be computed” [6, 9].

• Quasi Newton Method It uses ”the quasi-Newton BFGS approximation to theHessian built up by updates based on past steps” [5, 6, 9].

• Conjugate Gradient It is ”a non-linear version of the conjugate gradient methodfor solving linear systems” [6, 9].

• Principal Axis It ”works without using any derivatives, not even the gradient, butby keeping track of the values from past steps. It requires two starting conditionsin each variable” [7, 9].

Thus, we chose the method which could work for all n, with different initial points. Itcame out that the only method working every time is the nonlinear Conjugate Gradientmethod, or Gradient method which simply is the older version of the first one (see B fora complete description of the algorithm). Note that as a numerical algorithm converges,it is necessary to keep track of the convergence and try to understand if the maximumpoint has been approached closely enough. Since we are dealing with bivariate functions,this depends on the sequence of steps taken and the values of the function, its gradientand the Hessian at these points.

In general there are two fundamental methods for moving from the current point xkto the new one xk+1. The line search method and the trust region method. In thefollowing part we are going to consider algorithms which try to minimize the objec-tive function. In our case we want to maximize, but the adaption easily follows. Thefollowing parts are taken from [6].

4.1.1. The line search method

The line search method chooses a direction pk and searches along this direction fromthe current iterate xk to a new iterate xk+1 with a lower function value. The distanceα to move along pk can be found by approximately solving the following minimizationproblem:

minα>0

f(xk + αpk). (4.1)

Solving exactly (4.1) is usually not needed. Therefore, the line search algorithm gen-erates a limited number of trial step lengths until it finds the one that approximatesthe minimum of (4.1). At xk+1 a new search direction pk+1 and step length αk+1 arecomputed, and the process is repeated.When choosing the direction pk in the line search approach, the steepest descent di-rection −∇fk is the most obvious one, as moving from xk it is the one along which fdecreases most rapidly [6], yielding to the steepest descent method (but slow on difficultproblems). Another important search direction is the Newton direction, which is derivedfrom the second-order Taylor series approximation to f(xk + p),

f(xk + p) ≈ fk + pT∇fk +1

2pT∇2fkp =: mk(p). (4.2)

38

Assuming that ∇2fk is positive definite, we obtain the Newton direction by finding thevector p which minimizes mk(p), namely

pk = −(∇2fk)−1∇fk. (4.3)

Usually methods that use the Newton direction have a quadratic rate of convergence.An alternative is given by the Quasi-Newton search direction, which does not requirethe computation of the Hessian matrix [6].

4.1.2. The trust region method

The second method, known as trust region, is such that the information about thefunction f is used to build a model function mk of f, such that they behave similarlynear the current point xk. In order to do that, the algorithm restricts the search for aminimizer of mk to some region around xk. In other words, the candidate step p is foundby approximately solving the following sub-problem:

minpmk(xk + p), where xk + p lies inside the trust region. (4.4)

If the candidate solution does not sufficiently decrease f, the trust region is too large anda new solution of (4.4) is found. Usually the trust region is a ball defined by ||p||2 ≤ ∆,where the scalar ∆ > 0 is called trust-region radius.The model mk in (4.4) is usually defined by the following quadratic function

mk(xk + p) = fk + pT∇fk +1

2pTBkp, (4.5)

where fk,∇fk, and Bk are a scalar, vector, and matrix, respectively. The matrix Bk iseither the Hessian ∇2fk or some approximation. If we set Bk = 0 in (4.5) and define thetrust region using the Euclidean norm, the trust region problem (4.4) becomes

minpfk + pT∇fk subject to ||p||2 ≤ ∆k,

and the solution is given by

pk = −∆k∇fk||∇fk||

.

This is simply the steepest descent step in which the step length is determined by thetrust-region radius. If we choose Bk to be the exact Hessian matrix then we obtain thetrust-region Newton method, which has been shown to be highly effective in practice [6].

Therefore, we see that the line search and trust region strategies differ in the order inwhich they choose the direction and distance of the move to the next iterate.

39

4.2. Setting the problem

Let us consider again the model (3.1) given in chapter 3, and the profile log-likelihoodobtained in (3.12),

L(α, β) =1√

(2π)n|Kh,α,β|exp

(− 1

2(Y − Y )T (Kh,α,β)−1(Y − Y )

).

By propositions 1 and 2 we have

• for n = 2 with probability 1 the function reaches ∞, and it has two ridges, whoseequations are found

• for n > 2 with probability 1 the function goes to −∞ in all directions, and hencewe can suppose it has one and probably more local maxima.

Our goal is to find a systematic method such that it can detect all the local maximaand the global maximum of the function, taking random or real data as input. Thus,the first thing to decide, once the method is chosen, is which initial point we should giveto the optimisation algorithm. It would make sense to give a point where the Hessianmatrix is definite negative, hence the marginal likelihood concave, and relatively closeto the point of maximum. Therefore, as we are supposing that the local maxima are onthe two ridges and/or around the cusp, we give the following three intial points:

• (α∗, β∗), the intersection of the two asymptotes of logL(α, β) found in the previouschapter (Proposition 1),

(α∗, β∗) =

(log( 1

n

n∑i=1

(aTi y∗)2

λi

), log

( 1

n

n∑i=1

(aTi y∗)2))

. (4.6)

• (α∗, β∗α∗), where β∗α∗ is defined such that

β∗α∗ := arg maxβ

logL(α∗, β) (4.7)

• (α∗β∗ , β∗), where α∗β∗ is defined such that

α∗β∗ := arg maxα

logL(α, β∗) (4.8)

We see that (α∗, β∗) can be found analytically, while it is not possible for the other twopoints, (α∗, β∗α∗) and (α∗β∗ , β

∗). Thus, again, in this situation we are going to use thecommand FindMaximum.

For i = 1, . . . , n, we simulate randomly, yi as real values between 0 and 1, and xi suchthat xi = range[n]−0.5

n , where in Mathematica range[n] generates a list of integer valuesbetween 1 and n. Let us have a look firstly at the plots (with some possible variations)of

40

-80 -60 -40 -20b

-40

-20

20

40

60

(a)

-80 -60 -40 -20b

-60

-40

-20

20

(b)

Figure 4.1.: Random plots of logL(α∗, β)

-80 -60 -40 -20a

-2.395

-2.390

-2.385

-2.380

-2.375

(a)

-50 -40 -30 -20 -10 10a

-0.4

-0.3

-0.2

-0.1

0.1

(b)

Figure 4.2.: Random plots of logL(α, β∗)

• logL(α∗, β) in figure 4.1

• logL(α, β∗) in figure 4.2

As you can see in figure 4.1 the evaluation of the function logL(α∗, β) is not stable aftera certain value of β. Thus, FindMaximum is not always able to find exactly the point ofmaximum, and it will often give back a value with a warning (e.g. ”machine precision isinsufficient to achieve the requested accuracy or precision”). In figure 4.2 the situationis different. If we look at the first one 4.2a, we do not have a maximum, the functionis just increasing until it seems it remains constant after a certain value of α. Thus,FindMaximum gives back different values if we use different methods, depending on theconvergence of these. In the second one, 4.2b, we see the function having a maximum,whose value will be exactly given.

4.3. Results: number of local maxima and location of theglobal maximum

We ran 150 simulations for each n = 3, . . . , 10 observation. As said before, we expectedthe function having at most three local maxima, two on the ridges and one around thecusp. Let us call the α-ridge, the ridge where β = β∗ and α is varying, and the β-ridgedefined in the analogous way, where α = α∗ and β is varying . Surprisingly, the functionshows at most two local maxima, one around the cusp and one on or around the β-ridge.

41

Moreover, the global maximum is always on the β-ridge or around it. Here we comparethe results obtained for different numbers of observations n.

Estimated probability of having at least 1 maximum

n = 3 n = 4 n = 5 n = 6 n = 7 n = 8 n = 9 n = 10

1 1 1 1 1 1 1 1

Estimated probability of having at least 2 local maxima

n = 3 n = 4 n = 5 n = 6 n = 7 n = 8 n = 9 n = 10

0 0 0.06 0.12 0.22 0.187 0.193 0.293

Estimated probability of having the globlam maximum on the β-ridge

n = 3 n = 4 n = 5 n = 6 n = 7 n = 8 n = 9 n = 10

0.76 0.527 0.427 0.36 0.307 0.367 0.24 0.427

Figure 4.3. Figure 4.4.

Note that the estimated probabilities in the above tables are computed as fractions ofthe 150 runs.

We firstly see from the tables, and then from the graphs below, that the probabilityof logL(α, β) having at least one maximum is always one. This result should not sur-prise us, in fact, already in the previous chapter, we have shown that for n ≥ 3 thefunction goes to −∞ in almost all directions, and so it has at least one maximum withprobability one (since it is continuous). More interesting is that the probability of havingat least two local maxima is not always the same. For n = 3, 4 the probability is actuallyzero, while for n > 4 this probability increases, obtaining the biggest value for n = 10.It seems then that as n gets bigger, the probability does so, even if for n = 7 we havean exception, as the value here is bigger than the value for n = 8, 9, but it is probablydue to the fact of having 150 simulations, so the probability estimates may not be thatexact.

42

Regarding the location of the two maxima, we have the global maximum always on oraround the β-ridge, usually it would be given by (α∗, β∗α∗), while the local optimumis around the cusp. However, it is not always possible to determine the location withenough accuracy and precision.

Overall we can observe the following three different behaviours of logL(α, β). The func-tion having at least two local maxima, one is visible and around the cusp, while theother one found by Mathematica is on or around the β-ridge (figure 4.6, where 4.6a isthe plot of logL(α, β) for a certain domain and 4.6b is still the plot of the same functionbut just zoomed in). Then, in figure 4.7, where again the second picture is the samefunction just zoomed in, we do not have any maximum around the cusp, the functionproceeds smoothly and actually it seems to decrease until it becomes flat on the α-ridge.Thus it seems that this time the function has only one point of maximum. This is finallydifferent than the behaviour of the function that we have in figure 4.8, where even if thefunction does not show any maximum around the cusp, it increases passing from thecusp to the α-ridge, but still the function shows having only one maximum.

It is interesting is to compare our results with what we obtained in the previous chap-ter. In particular we can look at logL(−∞, β∗) and logL(α∗,−∞) given by (3.21) and(3.22), respectively. We computed them in every simulation, and what we obtained isthat logL(α∗,−∞) is always smaller than the maximum found by the program. Thislet us suppose that what we found seems to be a global maximum, as the function thendecreases until it becomes constant for fixed α and as β goes to −∞. On the contrast,logL(−∞, β∗) has the same value as the value the function has on the α-ridge. It seemsthen that on the α-ridge there is a very quick convergence to a constant value and proba-bly the maximum is logL(−∞, β∗), even though we cannot numerically see this (it lookscompletely flat).

Overall, we can conclude that there is no evidence of a third local maximum, in partic-ular not on the α-ridge, as we thought at the beginning. Indeed, the function shows atmost two local maxima, of which the global maximum is on (or around) the β-ridge.

4.4. Comments and problems

It is not always possible to get a uniform interpretation of a result, especially when itdoes not lead to a unique final conclusion.We have asked ourselves ”how many (local and global) maxima are there in the marginallikelihood for Brownian motion regression with independent identical distributed normalerrors?”. The results suggest the following. The function does not always have a uniquepoint of maximum, but it happens to have at most two local maxima, with a probabilitythat increases as the number of observations n increases.

It is noted that our study is limited by several practical considerations. First of all

43

(a) β ∈ (−80, 10) (b) β ∈ (−80,−60) (c) β ∈ (−100,−60)

Figure 4.5.: Example of logL(α, β) with n = 10

we only ran 150 simulations, thus in order to obtain more accurated values of the esti-mated probabilites, it would be better to increase the number of runs to at least 1000for each n = 3, . . . , 10, number of observations.Furthermore, the number of observations n should be increased as well, in order to ob-tain broader results.Finally, while implementing the algorithm that brought us to these results, we hadnumerical problems. In particular, we had two different numerical problems in the eval-uation of the function.First of all, it is not possible to detect and locate the maximum on the α-ridge. Math-ematica finds it, but it is not a point of maximum, the ridge in that point it actuallyremains flat. Second of all, it is difficult to detect and locate the maximum on the β-ridge,namely the global maximum. Especially, for β < −70, the function starts being moreand more bumpy and, as a consequence, FindMaximum is not always able to intercepta unique point of maximum (figure 4.5). Therefore, it often happened to get more thanone point of maximum around the β-ridge with really high values of the order of 1010 oreven bigger. It is important to notice that we did not consider these as maxima and thisis also the reason why it is not possible to give a more exact location of the local maxima.

Nonetheless, we obtained interesting result and most of all, given the initial pointsmentioned in the previous section, it is always possible to find the local optima of thefunction. Therefore, we found what seems to be an automatic method to find a numberof (or possibly all) the local maxima of the marginal likelihood. Finally, we were notable to establish the existence of a third local maximum, as it has never been possibleto intercept it and there were not enough evidences to show it.

44

(a) (b)

Figure 4.6.: Example of logL(α, β) with n = 10 having at least two local maxima

(a) (b)

Figure 4.7.: Example of logL(α, β) with n = 6 having at least one maximum

(a) (b)

Figure 4.8.: Example of logL(α, β) with n = 6

45

5. Conclusion

This thesis concerns the theoretical and numerical study of the marginal likelihood, asfunction of the GPR hyperparameters. In particular, the main focus is the analysis ofa potential major problem which received little attention in the literature, namely, theexistence of multiple local maxima. This work can be easily divided into two differentbut related parts. Let us see each of them.

The first part made by chapters 1 and 2, served as the main purpose to give a moregeneral and theoretical framework to Gaussian process regression and the estimation ofGPR hyperparameters. Starting from the background required to deal with Gaussianprocess regression, we have seen how it is structured, describing all its components, itsscope, and how it works, focusing especially on Gaussian processes, covariance function,its importance and the related hyperparameters. Especially these last two elements arenecessary for the second chapter, where the problem of estimating GPR hyperparame-ters is described. Two approaches are explained, the fully Bayes and empirical Bayesmethods. In particular, we successively adopted the second one, where the hypeparame-ters are estimated by means of maximum marginal likelihood. Hence, a major potentialproblem is introduced, the existence of multiple local maxima, and therefore the risk offinding the local optima instead of the global maxima. Finally, some asymptotic prop-erties of the EB methods are presented.

The second part of this thesis is given by chapters 3 and 4. Here the attention isrestricted to a specific GPR model, which depends only on two hyperparameters, theerror variance and the scale parameter of the Gaussian process. This part could againbe divided into two different but related parts.A theoretical study of the marginal likelihood is done in chapter 3. We looked at thepossible stationary points of the function and we analysed its asymptotic behaviour.Interesting results have then been obtained. We first analysed an easier case, when wehave n = 2 observations. Here, the function has no stationary points and moreover itgoes to ∞ with probability one. However, the situation is different if n > 2. Indeed, itwas not possible to analytically detect any stationary points of the function, but, moreimportantly and differently from the previous case, the function goes to −∞ in all direc-tions, and so, since the function is continuous, it admits at least one point of maximumwith probability one. Moreover, important properties of the ridges (whose equationswere found for the case n = 2) and asymptotes of the functions are found.Issues which were theoretically intractable are analysed in chapter 4. Using the soft-ware Wolfram Mathematica 11.2, we implemented the nonlinear Conjugate Gradientoptimisation algorithm to find the local maxima of the marginal likelihood and we run

46

simulations in order to estimate the probability of having one or more local maxima.Results showed that for n > 2 the function has with probability one a point of maxi-mum, already obtained previously, but confirmed numerically, and, more interestingly,it seems that as n gets bigger, the estimated probability of having at least two localmaxima gets bigger. Furthermore, the global maximum was always found on the β ridgeof the function or around it.Finally and most importantly, we found what seems to be an automatic method to detecta number of (or possibly all) the local maxima of the marginal likelihood. Indeed, givenas initial points for the optimisation algorithm (in this case the Conjugate Gradientalgorithm) (α∗, β∗), (α∗, β∗α∗) and (α∗β∗ , β

∗), found in the previous chapter, we are ableto find (possibly all of) the maxima of the function located around the cusp and on oraround the β-ridge.

Overall, even in this very simple GPR model, this work shows that the marginal like-lihood is affected by the problem of the existence of multiple local maxima. However,if at the beginning we expected the function to have at most three local maxima, weactually found no evidence of the existence of a third maximum, and in fact, the functionshowed at most two local maxima and only for n > 4 observations. This together withthe increase of the estimated probability of having 2 local maxima as n gets bigger, sug-gests that increasing the number of observations n, we will still have two local maxima.However, the posterior, as we saw at the end of chapter 2, as n grows indefinitely will con-centrate around the global maximum. In particular this will be on or around the β-ridge.

In conclusion, this study could provide help to researchers and programmers to findthe global maximum of the marginal likelihood in order to retrieve the most usefulinformation from the data. However, note that the study in this thesis is far from com-prehensive. Future work can look at a wider range of covariance functions and relatedhyperparameters which need to be considered, as well as more complex and real data,and a larger number of n observations. Moreover, we faced several numerical problems,hence other softwares and more efficient algorithms and methods could be implemented(see e.g. [21] or EM algorithm in [20]).

47

A. Appendix A

A.1. Bayes’s rule

Let us suppose we have data x modelled as a realisation of a random variable X. Astatistical model is the set of all possible probability distributions Pθ of X, indexed by aparameter θ taking values in the parameter space Θ. In a Bayesian framework θ is con-sidered to be a realisation itself of a random variable, say ϑ. Then we can formally defineX and ϑ as measurable maps on a probability space, with values in measurable spaces,(X ,F) and (Θ,H). The distribution Pθ is then considered a conditional distribution ofX given ϑ and (X,ϑ) has joint probability distribution given by

P (X ∈ A, ϑ ∈ B) =

∫BPθ(A)dΠ(θ), (A.1)

for every A ∈ F , B ∈ H and Π is the marginal distribution of ϑ, which is called theprior distribution of ϑ in this setup. Finally, we can define the posterior distribution ofϑ, which is the conditional distribution of ϑ given X = x,

Π(B|x) = P (ϑ ∈ B|X = x), (A.2)

for B ∈ H PΠ-a.s..We can now present a version of Bayes’s formula, as it is given in Ghosal S. and Van derVaart A. 2017 [27]. For a dominated collection of probability distributions Pθ : θ ∈ Θ,the distributions Pθ of the statistical model are defined such that they allow the existenceof jointly measurable densities relative to a σ-finite measure µ in the sample space (X ,F).We assume that there exists a map (x, θ) → pθ(x) that is measurable in (F ,H), suchthat Pθ(A) =

∫A pθ(x)dµ(x), for every A ∈ H. In general, the following holds:∫

AΠ(B|X)dPΠ =

∫BPθ(A)dΠ(θ), (A.3)

where θ 7→ Pθ(A) must be measurable for all A. Moreover, whenever Pθ : θ ∈ Θ isdominated, a version of the posterior distribution (A.2) is a solution of (A.3) and it isgiven by:

Π(B|x) =

∫B pθ(x)dΠ(θ)∫pθ(x)dΠ(θ)

, (A.4)

which holds when the denominator∫pθ(x)dΠ(θ) is not zero. If there exists a x such

that the denominator happens to be zero, then the posterior distribution (A.2) is equalto Q(B) for any arbitrary probability measure Q on (Θ,H).

48

Proposition 3. (Bayes’s rule) For a dominated collection of probability distributionsPθ : θ ∈ Θ, if there exists a σ-finite measure µ on the sample space (X ,F) and jointlymeasurable maps (x, θ)→ pθ(x) such that Pθ(A) =

∫A pθ(x)dµ(x), for every A ∈ H, then

(A.4) gives an expression for the posterior distribution.

A detailed proof of this proposition can be e.g. found in [11]. The denominator∫pθ(x)dΠ(θ) in (A.4) acts as the norming constant for the density, such that the quotient

(A.4) is 1 for B = Θ. Therefore, we can rewrite Bayes’s formula as,

dΠ(θ) ∝ pθ(x)dΠ(θ), (A.5)

or in words,posterior ∝ likelihood × prior,

where ∝ means ’proportional to’.

A.2. Eigendecomposition of a matrix

By eigendecomposition of a matrix we mean the factorization of a matrix into a canoni-cal form, whereby the matrix is represented in terms of its eigenvalues and eigenvectors.Only diagonizable matrices can be factorized in this way.

Let A be a square diagonizable n × n matrix, with n linearly independent eigenvec-tors qi and eigenvalues λi for i = 1, . . . , n. Then A can be factorized as

A = QΛQ−1 (A.6)

where Q is the square n×n matrix whose ith column is the eigenvector qi of A and Λ isthe diagonal matrix whose diagonal elements are the corresponding eigenvalues, namelyΛii = λi. In particular, a matrix A which admits the above eigendecompostion can bewritten as

A =n∑i=1

λiqi, (A.7)

the determinant of A is the product of the eigenvalues

|A| =Nλ∏i=1

λnii , (A.8)

and the trace of A is the sum of the eigenvalues

Tr(A) =

Nλ∑i=1

niλi, (A.9)

where both in (A.8) and (A.9) ni is the algebraic multiplicity.

49

Finally, if a matrix admits inverse matrix, namely all its eigenvalues are different fromzero, A is nonsingular and its inverse A−1 has the following eigendecomposition

A−1 = QΛ−1Q−1 (A.10)

where Λ−1ii = 1

λi.

A.3. Matrix Derivatives

Derivatives of the elements of an inverse matrix:

∂

∂λA−1 = −A−1∂A

∂λA−1 (A.11)

where ∂A∂λ is a matrix of elementwise derivatives. For the log determinant of a positive

definite symmetric matrix we have

∂

∂λlog |A| = Tr

(A−1∂A

∂λ

). (A.12)

A.4. Gaussian Identities

Let us suppose we have a multivariate Gaussian vector with mean vector µ ∈ RN andN × N covariance matrix Σ, X ∼ N (µx,Σ), which has joint probability density givenby

p(x) =1√

(2π)N |Σ|exp

(−1

2(x− µx)TΣ−1(x− µx)

). (A.13)

Thus, let us consider Y and Z be jointly Gaussian random vectors,[YZ

]∼ N

( [µyµz

],

[A CCT B

] ),

then the marginal distribution of Y and the conditional distribution of Y given Z aregiven by

Y ∼ N (µy, A), (A.14)

Y |Z ∼ N (µy + CB−1(Z − µz), A− CB−1CT ), (A.15)

see e.g. [1].

50

B. Appendix B

The nonlinear conjugate gradient method is an adaptation of the linear conjugate gra-dient method to nonlinear convex function f . Therefore, let us briefly see the latterapproach to then discover the former. It is taken from J. Nocedal and S. Wright 2006[6].

B.1. The Linear Conjugate Gradient Method

The conjugate gradient method is an iterative method used to solve linear systems ofequations

Ax = b, (B.1)

where A is an n × n symmetric positive definite matrix. The problem can be statedequivalently as a minimization problem for φ(x), where,

φ(x) :=1

2xTAx− bTx, (B.2)

and (B.1) and (B.2) have the same unique solution. We notice that

∇φ(x) = Ax− b =: r(x), (B.3)

hence, in particular, at x = xk we have

rk = Axk − b. (B.4)

Definition B.1. A set of nonzero vectors p0, . . . , pd is said to be conjugate withrespect to the symmetric definite positive matrix A if

pTi Apj = 0, for all i 6= j. (B.5)

If a set of vectors has the conjugacy property than they are also linear independent[6]. Then, given a starting point x0 ∈ Rn and a set of conjugate vectors p0, . . . , pn−1,let us generate the sequence xk by setting

xk+1 = xk + αkpk, (B.6)

where αk is the one-dimensional minimizer of φ(·) along xk + αpk, given by

αk = −rTk pk

pTkApk. (B.7)

51

So, we have that a sequence of iterates generated as in (B.6) converges to the solutionx∗ in at most n steps [6].

The conjugate gradient method has the important property that in generating its sets ofconjugate vectors, it can compute a new vector pk by using only the previous one pk−1

and pk is automatically conjugate to p0, . . . , pk−1. This means saving storage and littlecomputation. In the conjugate gradient method pk is chosen such that

pk = −rk + βkpk−1, (B.8)

where bk is a a scalar to be determined by asking pk and pk−1to be conjugate with respectto A, namely

βk =rTk Apk−1

pTk−1Apk−1. (B.9)

Then, p0 is chosen such that it is the steepest descent direction at the initial point x0 andwe perform successive one-dimensional minimizations along each of the search directions.If we replace αk and βk+1 with

αk =rTk rk

pTkApk

βk+1 =rTk+1rk+1

rTk rk,

we obtain the following standard and more efficient conjugate gradient algorithm:

(CG) AlgorithmGiven x0;Set r0 ← Ax0 − b, p0 ← −r0, k ← 0;while rk 6= 0

αk ←rTk rk

pTkApk; (B.10)

xk+1 ← xk + αkpk; (B.11)

rk+1 ← rk + αkApk; (B.12)

βk+1 ←rTk+1rk+1

rTk rk; (B.13)

pk+1 ← −rk+1 + βk+1pk; (B.14)

k ← k + 1 (B.15)

end (while)

52

B.2. The Nonlinear Conjugate Gradient Method

The first nonlinear conjugate gradient method was proposed by Fletcher and Reeves1964 [23] as follows. If we look at the previous algorithm, in place of αk we need toperform a line search that identifies an approximate minimum of f in the directon ofpk. Second, the residual r, which is simply the gradient of φ must be replaced by thegradient of the nonlinear objective function f. These modifications yield the followingalgorithm for nonlinear optimization.

(FR) AlgorithmGiven x0;Evaluatef0 = f(x0),∇f0 = ∇f(x0);Set p0 ← −∇f0, k ← 0;while ∇fk 6= 0

Compute αk and set xk+1 = xk + αkpk;

Evaluate ∇fk+1;

βFRk+1 ←∇fTk+1∇fk+1

∇fTk ∇fk; (B.16)

pk+1 ← −∇fk+1 + βFRk+1pk; (B.17)

k ← k + 1 (B.18)

end (while)

We need to be more precise aboute the choice of the line search parameter αk. Be-cause of the second term in (B.16), the search direction pk may fail to be a descentdirection unless αk satisfies certain conditions. By taking the inner product of (B.16)with the gradient vector ∇fk, we obtain

∇fTk pk = −||∇fk||2 + βFRk ∇fTk pk−1. (B.19)

If the line search is exact, namely αk−1 is a local minimizer of f along pk−1, we have∇fTk pk−1 = 0, and in this case from (B.19) we have ∇fTk pk < 0, so that pk is indeeddescent direction. If the line search is not exact, then we may have ∇fTk pk > 0, implyingthat pk is actually a ascent direction. We can avoid this imposing the Wolfe conditions[6],

f(xk + αkpk) ≤ f(xk) + c1αk∇fTk pk, (B.20)

|∇f(xk + αkpk)T pk| ≤ −c2∇fTk pk, (B.21)

where 0 < c1 < c2 <12 .

53

An alternative method, which often works better in practice (more efficient and ro-bust) is that of Polak and Ribiere [6], which differs from the FR method in the choiceof the parameter βk, as follows

βPRk+1 =∇fTk+1(∇fk+1 −∇fk)

||∇fk||2. (B.22)

In Mathematica, the default conjugate gradient method is the Polak-Ribiere, but it canbe chosen the Fletcher-Reeves as well. The advantage of the conjugate greadient algo-rithm is that it uses relatively little memory for large nonlinear optimization problemsand requires no numerical linear algebra, but only the evaluation of the objective func-tion and its gradient, and finally no matrix operations for the step computation. Adrawback is that it typically converges much more slowly than Newton or quasi-Newtonmethods. However, back to our study, the Conjugate Gradient algorithm was the onlyone working for every n number of observations, and for every function we needed tomaximize.

Finally, one issue that can arise with nonlinear conjugate gradient methods is whento restart them. The usual procedure is to restart the iteration at every n steps bysetting βk = 0 in (B.16), that is, by taking a steepest descent step. Restarting serves toperiodically refresh the algorithm, erasing information that may not be beneficial. Thisleads to n-steps quadratic convergence [6], that is,

||xk+n − x|| = O(||xk − x∗||2). (B.23)

An alternative, as the first one may be not practical for problems with large n, is torestart whenever two consecutive gradients are far from orthogonal, as measured by thetest

|∇fTk ∇fk−1|||∇fk||2

≥ ν, (B.24)

where a typical value of the treshold ν is 0.1.

54

Popular summary

Gaussian process regression is a useful and extremely powerful methodology which hasbeen used in many fields, such as machine learning, classification, geostatistics, geneticsand so on.

Let us consider the usual regression model used to make prediction and inference. In theGaussian process regression framework, instead of choosing a specific predictive functionf , such as a linear or polynomial function of the inputs, we assume that the predictivefunction is Gaussian distributed, and thus, we can specify a mean and a covariancefunction. In particular, the covariance function deeply characterizes and affects GPRperformance, and hence, the predictive value, of the model. Moreover, the covariancefunction depends on free parameters, called hyperparameters, which need to be prop-erly determined. The common approach is to estimate them from the training data bymeans of maximum marginal likelihood. A potential problem that one can encounteris the existence of multiple local maxima and therefore, the estimated hyperparametermay not be a global maximum.

This thesis, in particular, focuses on a specific and simple GPR model which dependsonly on two hyperparameters. This work combines theoretical and numerical study ofthe marginal llikelihood, as function of the hyperparameters. Firstly, its asymptoticbehaviour is studied and properties regarding the more general function behaviour arederived, such as its limits, equations of its asymptotes and, in a restricted case, of itsridges as well. Then, we numerically study the function, implementing optimisationalgorithms in order to find the local maxima, and locate the global maximum of thefunction. Moreover, we run simulations in order to estimate the probability of havingone, two or more local optima. Finally, the main and most important goal of this thesisis to find an automatic and systematic method to detect a number of (and possibly all)the local maxima of the marginal likelihood.

55

Bibliography

[1] Rasmussen C. E. and Williams C. K. I. (2006) Gaussian Processes for MachineLearning. The MIT Press.

[2] Carlin B. P. and Louis T. A. (1996) Bayes and Empirical Bayes Methods for DataAnalysis. Chapman and Hall.

[3] Pawitan Y. (2001) In All Likelihood: Statistical Modelling and Inference Using Like-lihood. Clarendon Press Oxford.

[4] Ebden M. (August 2008) Gaussian Processes for Regression: A Quick Introduction.

[5] Dennis Jr. J. E. and Schnabel, R. B. (1996). Numerical methods for unconstrainedoptimization and nonlinear equations (Vol. 16). Siam.

[6] Wright S. J. and Nocedal J. (2006). Numerical Optimization (Second Edition).Springer Science.

[7] Brent, R. P. (2013). Algorithms for minimization without derivatives. Courier Cor-poration.

[8] Spieksma F. (2017). An Introduction to Stochastic Processes in Continuous Time:the non-Jip-and-Janneke-language approach. Lecture notes, Leiden University.

[9] Wolfram Mathematica Tutorial Collection (2008). Unconstrained Optimization.Wolfram Reasearch, Inc.

[10] Shilling R. L. and Partzsch L. (2012). Brownian Motion. An Introduction to Stochas-tic Processes. Walter de Gruyter GmbH and Co KG.

[11] Szabo B. and van der Vaart A.(2017). Bayesian Statistics. Lecture notes, LeidenUniversity.

[12] Schervish M. J. (1995). Theory of Statistics. Springer.

[13] Chen Z. and Wang B. (2017). How priors of initial hyperparameters affect Gaussianprocess regression models. arXiv 1605.07906.

[14] Neal R. M. (1997). Monte Carlo Implementation of Gaussian Process Models forBayesian Regression and Classification. arXiv physics/9701026.

[15] Wilson A. G. (2014). Covariance Kernels for Fast Automatic Pattern Discovery andExtrapolation with Gaussian Processes. PhD Thesis, University of Cambridge.

56

[16] Duvenaud D., Lloyd J. R., Grosse R., Tenenbaum J. B., Ghahramani Z. (2013).Structure Discovery in Nonparametric Regression through Compositional KernelSearch. arXiv 1302.4922.

[17] Williams C. K. and Rasmussen C. E. (1996). Gaussian processes for regression, in:Advances in neural information processing systems.

[18] D. J. MacKay, Gaussian processes - a replacement for supervised neural networks?

[19] Rasmussen C. E. (1996). Evaluation of Gaussian Processes and Other Methods forNon-linear Regression. PhD Thesis, University of Toronto. pp. 49-67.

[20] Jin C., Zhang Y., Balakrishnan S., Wainwright M. J., Jordan M. (2016). LocalMaxima in the Likelihood of Gaussian Mixture Models: Structural Results and Al-gorithmic Consequences. arXiv 1609.00978.

[21] Schirru A., Pampuri S., De Nicolao G., McLoone S. (2011). Efficient MarginalLikelihood Computation for Gaussian Processes and Kernel Ridge Regression. arXiv1110.6546.

[22] Szabo B. T., van der Vaart A. W., van Zanten J.H. (2013). Empirical Bayes scalingof Gaussian priors in the white noise model. Vol 7 991-1018 Electronic Journal ofStatistics.

[23] Fletcher R. and Reeves C. M. (1964) Function minimization by conjugate gradients.Computer Journal, 7, pp. 149-154.

[24] Shamshirband S., Mohammadi K., Yee L., Petkovic D., Mostafaeipour A. (2015). Acomparative evaluation for identifying the suitability of extreme learning machine topredict horizontal global solar radiation. Renewable and Sustainable Energy Reviews52 1031–1042.

[25] Jahangirzadeh A., Shamshirband S., Aghabozorgi S., Akib S., Basser H., Anuar N.B., Kiah M. L. M. (2014). A cooperative expert based support vector regression (Co-ESVR) system to determine collar dimensions around bridge pier. Neurocomputing140 172–184.

[26] Billingsley, P. (2008). Probability and measure. John Wiley and Sons.

[27] Ghosal S. and Van der Vaart A. (2017). Fundamentals of nonparametric Bayesianinference. Cambridge University Press.

[28] Brochu E., Cora M., and de Freitas N. (2010). A tutorial on Bayesian optimizationof expensive cost functions, with applications to active user modeling and hierarchi-cal reinforcement learning. arXiv preprint 1012.2599.

[29] Rousseau J. and Szabo B. T. (2017). Asymptotic behaviour of the empirical Bayesposteriors associated to maximum marginal likelihood estimator. The Annals ofStatistics. Vol 45, No. 2, 833-865. Institute of Mathematical Statistics.

57

[30] Petrone S., Rousseau J. and Scricciolo C. (2014). Bayes and empirical Bayes: Dothey merge?. Biometrika 101 285-302. MR3215348.

[31] Donnet S., Rivoirard V., Rousseau J. and Scricciolo C. (2014). Posterior contrac-tion rates for empirical Bayes procedures , with applications to Dirichlet Processmixtures. arXiv 1406.4406.

58

Computational Aspects of Gaussian Process Regression · Computational Aspects of Gaussian Process...

Documents

Transcript of Computational Aspects of Gaussian Process Regression · Computational Aspects of Gaussian Process...