Regression With Gaussian Measures - Peoplejordan/sail/readings/archive/meyer… · We treat the...

Regression With Gaussian Measures

Michael J. Meyer

Copyright c© April 11, 2004

ii

PREFACE

We treat the basics of Gaussian processes, Gaussian measures, kernel re-producing Hilbert spaces and related topics. All mathematical details areincluded and every effort is made to keep this as selfcontained as possible.Only elementary Hilbert space theory and integration theory as well as basicresults from probability theory are assumed.

This is a work in progress and has been written up in haste. Undoubtedlythere are mistakes. Please email me at [email protected] if you findmistakes or have suggestions.

Michael J. MeyerApril 11, 2004

Contents

1 Introduction 1

2 Operators on Hilbert Space 52.1 Hilbert space basics . . . . . . . . . . . . . . . . . . . . . . . 52.2 Adjoint operator . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Selfadjoint and positive operators . . . . . . . . . . . . . . . . 82.4 Compact operators between Banach spaces . . . . . . . . . . 102.5 Compact selfadjoint operators . . . . . . . . . . . . . . . . . . 142.6 Compact operators between Hilbert spaces . . . . . . . . . . . 172.7 Hilbert-Schmidt and trace class operators . . . . . . . . . . . 202.8 Inverse problems and regularization . . . . . . . . . . . . . . . 25

2.8.1 Regularization . . . . . . . . . . . . . . . . . . . . . . 282.9 Kernels and integral operators . . . . . . . . . . . . . . . . . . 292.10 Symmetric kernels . . . . . . . . . . . . . . . . . . . . . . . . 342.11 L2-Bounded Kernels . . . . . . . . . . . . . . . . . . . . . . . 36

3 Reproducing Kernel Hilbert Spaces 373.1 Positive semidefinite kernels . . . . . . . . . . . . . . . . . . . 373.2 Translation invariant kernels . . . . . . . . . . . . . . . . . . . 393.3 Reproducing kernel Hilbert spaces . . . . . . . . . . . . . . . 413.4 Bilinear kernel expansion . . . . . . . . . . . . . . . . . . . . 493.5 Characterization of functions in HK . . . . . . . . . . . . . . 523.6 Kernel domination . . . . . . . . . . . . . . . . . . . . . . . . 553.7 Approximation in reproducing kernel Hilbert spaces . . . . . 613.8 Orthonormal bases . . . . . . . . . . . . . . . . . . . . . . . . 62

3.8.1 Second description of H . . . . . . . . . . . . . . . . . 66

4 Gaussian Measures 694.1 Probability measures in Hilbert space . . . . . . . . . . . . . 69

iii

iv CONTENTS

4.2 Gaussian measures on Hilbert space . . . . . . . . . . . . . . 734.3 Cameron-Martin space . . . . . . . . . . . . . . . . . . . . . . 794.4 Regression with Gaussian measures . . . . . . . . . . . . . . . 81

4.4.1 Model choices . . . . . . . . . . . . . . . . . . . . . . . 86

5 Square Integrable Processes 895.1 Integrable processes . . . . . . . . . . . . . . . . . . . . . . . 895.2 Processes with sample paths in an RKHS . . . . . . . . . . . 90

6 Gaussian random fields 1016.1 Definition and construction . . . . . . . . . . . . . . . . . . . 101

6.1.1 Construction of Gaussian random fields . . . . . . . . 103

A Vector Valued Integration 109

B Conditioning of multinormal Random Vectors 111

C Orthogonal polynomials 113C.0.2 Legendre polynomials . . . . . . . . . . . . . . . . . . 114

Chapter 1

Introduction

We will freely use the terminology which will be defined later. Let F bea nonempty set and f : F → R a real valued function on F . Considerthe following problem: we have observed the value of f at some pointsx1, . . . , xn ∈ F as

yj = f(xj), j = 1, . . . , n, (1.1)

and from this we want to estimate f itself. We will follow a Bayesian ap-proach. It is assumed that the function f belongs to a real vector space Hof functions on F . A prior probability P is placed on H and the regressorf (the estimate of f in light of the data) is computed as the mean of Pconditioned on the data (1.1).

The probability P is defined on the σ-field E generated by the continuouslinear functionals on H. If I : (H, E , P ) → H denotes the H-valued randomvariable defined as I(f) = f (the identity on H) then the mean of thedistribution P on H is the expectation EP [I] of I under P , that is, theH-valued integral

EP [I] =∫

HIdP =

∫Hf P (df), (1.2)

Do not worry if this sounds needlessly abstract since it is not how things arehandled in practice. It merely serves to motivate the procedures below. Thevector valued integral (1.2) commutes with all continuous linear functionalsΛ on H, that is,

Λ(EP [I]

)= EP (Λ I) =

∫H

Λ(f)P (df)

and the same holds true if the ordinary expectation is replaced with a con-

1

2 CHAPTER 1. INTRODUCTION

ditional expectation. The regressor f is the conditional expectation

f = EP [I|data] (1.3)

and so we haveΛ(f) = EP [Λ|data] (1.4)

for each continuous linear functional Λ on H. (note that Λ I = Λ). Thusrather than computing the regressor f globally as in (1.3) we compute Λ(f)for enough continuous linear functionals Λ on H to obtain a good view of f .For each x ∈ F let

Ex : f ∈ H 7→ f(x) ∈ R

denote the valuation functional at the point x. If Λ = Ex then Λ(f) = f(x)is our prediction for the value of f at the point x in light of the data (1.1).Note that the data themselves can be written in terms of the valuationfunctionals as

Ej(f) = yj, 1 ≤ j ≤ n, (1.5)

where Ej = Exj is the evaluation functional at the point xj . With this theregressor f becomes the condional expectation

f = EP [ I | Ej = yj , j ≤ n ]

andΛ(f) = EP [Λ | Ej = yj, j ≤ n ] , (1.6)

for each continuous linear functional Λ on H. To make this feasible we haveto assume that

1. The evaluation functionals Ex, x ∈ F , are continuous on H.

The computation of (1.6) involves only the finite dimensional distributionof the random vector

W = (E1, . . . , En,Λ)

on Rn+1 under the probability P . Note that each continuous linear func-tional on H is a random variable on the probability space (H, E , P ).

The measure P is is called a Gaussian measure on H if every continuouslinear functional Λ on H is a normal random variable under P . In this casethe distribution of the vector W is automatically Gaussian (multinormal)on Rn+1 and the computation of the conditional epxectation (1.6) involvesmerely routine computations with the multinormal density.

3

We have chosen the particular form (1.1) for the data because this is thestandard in regression problems. Note however that our approach applies toall forms of data and predictions which can be articulated in terms of eventsinvolving finitely many continuous linear functionals on H.

Regression with Gaussian processes assumes that f is the trajectory ofa Gaussian process Z = Z(x) on F . The mean of the process is assumedto be zero and thus the process Z completely determined by its covariancefunction K(x, y) which is a symmetric positive semidefinite kernel on F .

The kernel K : F × F → R is a parameter of the regression procedure.The space H is the product space H = RF of all functions f : F → R andthe probability P is the distribution of Z on H. Kolmogoroff’s existencetheorem for product measures guarentees the existence of the probability Pon H for every symmetric, positive semidefinite kernel K on F .

The space H = RF is a topological vector space with only one redeemingquality: the evaluation functionals are the coordinate functionals and hencecontinuous in the product topololgy on H.

Unfortunately there are essentially no other continuous linear functionalson H. Every continuous linear functional on H is a finite linear combinationof coordinate functionals.

Consequently this setup limits us to data presented in the form (1.1) andconsequent predictions of values f(x) at other points x ∈ F in a point bypoint fashion.

There are other disadvantages. For example it requires a substantialeffort to extract properties of the admissible functions f , that is, the trajec-tories of the Gaussian process Z, from properties of the covariance kernel Kand the resulting properties are often weaker than desired.

Consequently we take a slightly different approach. We assume insteadthat f is an element of a separable Hilbert space H of functions on F . P is aGaussian measure on H defined in terms of an orthonormal basis ψj of Hand a sequence (σj) of positive numbers (which diagonalize the covarianceoperator Q of P below).

We can then proceed as above provided that the evaluation functionalsare continuous on H. But we also have other options. The data and pre-dictions can be articulated in any fashion which uses only finitely manycontinuous linear functionals Λ on H. Point estimates are one possibility.Another possibility are the coefficients

Λ(f) = (f, ψk)

of f in the expansion of f =∑

j(f, ψj)ψj in the basis ψj of H.

4 CHAPTER 1. INTRODUCTION

Here we had to assume that the evaluation functionals are continuouson H. A Hilbert space of functions on F with this property is called areproducing kernel Hilbert space on F . Such a Hilbert space H defines aunique symmetric, positive semidefinite kernel K : F × F → R. Converselyevery symmetric, positive semidefinite kernel on K : F ×F → R determinesa unique reproducing kernel Hilbert space. There is an interesting interplaybetween orthonormal bases of H and the kernel K.

A basic question is how to find an orthonormal basis for H. If F ⊆ Rd iscompact and K is continuous, then we have additional structure in the formof the Euclidean topology and Lebesgue measure on X. Associated with thekernel K we have the integral operator T : L2(F ) → L2(F ) defined by

(Tf)(x) =∫

FK(x, y)f(y)dy, f ∈ L2(F ), x ∈ F,

where dx denotes Lebesgue measure on F . It turns out that T is a Hilbert-Schmidt operator. Consequently the orthogonal complement of the nullspace of T has an orthonormal basis φj consisting of eigenvectors of T .Let λj denote the corresponding eigenvalues. Then the functions

ψj =√λjφj

are an orthonormal basis for the reproducing kernel Hilbert space H withkernel K. This establishes the connection to the spectral theory of compact,selfadjoint operators on a Hilbert space.

There is another connection. For f ∈ H let Λf be the bounded linearfunctional Λf (h) = (h, f) on H. The Gaussian measure P on H defines aunique bounded linear operator Q : H → H such that the covariances of therandom variables Λf , Λg are given as

CovP (Λf ,Λg) = (Qf, g)H , f, g ∈ H. (1.7)

The operator Q is a positive trace class operator. Conversely for everypositive trace class operator Q : H → H, there exists a unique Gaussianmeasure P on H such that (1.7) holds.

Thus the material presents an interesting interaction of functional anal-ysis and probability theory. If you are only interested in the regression prob-lem you need only read Chapter 2, Chapter 3, sections 1-4,7,8 and Chapter4, sections 1,2,4.

Chapter 2

Operators on Hilbert Space

In this chapter we develop the spectral theory of compact operators betweenHilbert spaces. Our scalars are the reals, that is, we consider only real Hilbertspaces.

2.1 Hilbert space basics

We review the basics of Hilbert space theory. Let H be a (real) Hilbert spacewith inner product (·, ·). Let

H1 = x ∈ H : ‖x‖ ≤ 1

denote the closed unit ball in H and

S1(H) = x ∈ H : ‖x‖ = 1

the unit sphere in H. For vectors x, y ∈ H we write x ⊥ y (orthogonal) if(x, y) = 0. For subsets A, B of H we write A ⊥ B if a ⊥ b for all a ∈ A andb ∈ B. We let

A⊥ := x ∈ H | x ⊥ a, for all a ∈ A .Then A⊥ is a closed subspace of H. If V is a closed subspace of H, then

H = V + V ⊥,

in particular every closed subspace of H is complemented in H. This is thefirst fundamental fact about Hilbert spaces. Each element x ∈ H has uniquedecomposition x = v + v⊥ with v ∈ V and v⊥ ∈ V ⊥. We have

‖x‖2 = ‖v‖2 +∥∥∥v⊥∥∥∥2

5

6 CHAPTER 2. OPERATORS ON HILBERT SPACE

(the Law of Pythagoras). The map x 7→ v is called the perpendicular pro-jection onto the subspace V and is denoted πV . If (φj) is an ON-basis of V ,then

πV (x) =∑

j(x, φj)φj , x ∈ H. (2.1)

The second fundamental property of a Hilbert space H is the fact that thecontinuous linear functionals on H can be identified with the elements of Hif a ∈ H, then

Λa : x ∈ H 7→ (x, a) ∈ R

defines a continuous linear functional on H. The converse is also true: ev-ery continuous linear functional on H has this form (Riesz RepresentationTheorem).

Bilinear forms. Let X and Y be Hilbert spaces. A function ψ = ψ(x, y) :X × Y → R is called a bilinear form if it is linear in both variables x and y.The bilinear form ψ is called continuous if

‖ψ‖ = sup |ψ(x, y)| : x ∈ X1, y ∈ Y1 <∞. (2.2)

In this case |ψ(x, y)| ≤ ‖ψ‖ ‖x‖ ‖y‖, for all x ∈ X and y ∈ Y . Note that theclosed unit balls X1, Y1 can be replaced with the unit spheres S1(X), S1(Y )with no effect on the definition of the norm of ψ.

If A : X → Y is a bounded linear operator, then ψ(x, y) = (Ax, y)defines a continuous bilinear form on X × Y with ‖ψ‖ = ‖A‖. Conversely

Theorem 2.1.1 (Lax-Milgram). Let ψ = ψ(x, y) be a continuous bilinearform on X × Y . Then there exists a bounded linear operator A : X → Ysuch that ψ(x, y) = (Ax, y)Y , for all x ∈ X and y ∈ Y .

Proof. Fix x ∈ X. Then Λx(y) = ψ(x, y) is a continuous linear functional onY . By the Riesz Representation Theorem there exists an element a ∈ Y withΛx(y) = (a, y)Y , for all y ∈ Y . Clearly a is uniquely determined by x. Writea = Ax. Thus defines a map A : X → Y which satisfies ψ(x, y) = (Ax, y).

The uniqueness of a and linearity of ψ in the first argument imply thatthe map A is linear. The continuity of ψ implies that A is continuous.

If X = Y = H, then a bilinear form ψ = ψ(x, y) on X × Y is called abilinear form on H. Such a bilinear from is called symmetric if it satisfiesψ(x, y) = ψ(y, x), for all x, y ∈ H. In this case

Proposition 2.1.1. Let ψ = ψ(x, y) be a symmetric bilinear form on H.Then

‖ψ‖ = sup |ψ(x, x)| : ‖x‖ ≤ 1 (2.3)

2.2. ADJOINT OPERATOR 7

Proof. Let C denote the right hand side of (2.3). Obviously C ≤ ‖ψ‖ andwe have to show only the reverse inequality. Write φ(x) = ψ(x, x). Then|φ(x)| ≤ C if ‖x‖ ≤ 1. Using the the symmetry of ψ we can write

ψ(x, y) =12

[φ

(x+ y

2

)− φ

(x− y

2

)].

Recall that H1 deotes the closed unit ball in H. If x, y ∈ H1, then (x±y)/2 ∈H1 and it follows that |φ((x ± y)/2)| ≤ C. From this

ψ(x, y) ≤ 12(C + C) = C.

Taking the sup over all x, y ∈ H1 now yields ‖ψ‖ ≤ C.

2.2 Adjoint operator

The Lax-Milgram theorem can be used show the existence of the adjointoperator. Let X, Y be Hilbert spaces and T : X → Y a bounded linearoperator. Then ψ(y, x) = (y, Tx)

Yis a continuous bilinear form on Y ×X.

Consequently there exists a bounded linear operator T ∗ : Y → X such thatψ(y, x) = (T ∗y, x)

X, for all y ∈ Y and x ∈ X. It is easy o see that the

operator T ∗ is uniquely determined by its defining property

(Tx, y) = (x, T ∗y), x ∈ X, y ∈ Y.

Obviously T ∗∗ = T . We note the following

Proposition 2.2.1. We have(i) N(T ∗T ) = N(T ).(ii) N(T ∗) = R(T )⊥.(iii) N(T ) = R(T ∗)⊥.

Proof. (i) If Tx = 0, then T ∗Tx = 0. Conversely, if T ∗Tx = 0, then‖Tx‖2 = (Tx, Tx) = (T ∗Tx, x) = 0, thus x ∈ N(T ).(ii) Let w ∈ N(T ∗) and y = Tx for some x ∈ X. Then (y,w) = (x, T ∗w) = 0.Thus w ∈ R(T )⊥.

Conversely, if w ∈ R(T )⊥, then (T ∗w, x) = (y, Tx) = 0, for all x ∈ X.This implies T ∗y = 0 (let x = T ∗y) and so w ∈ N(T ∗). Now (iii) followsfrom this. Replace T with T ∗ and note that T ∗∗ = T .Remark. By taking orthogonal complements in (ii) and (iii) we obtainR(T ) ⊆ N(T ∗)⊥ and R(T ∗) ⊆ N(T )⊥ but we will not have equality ingeneral since R(T ) and R(T ∗) need not be closed.


For any subset A ⊆ X we have A⊥ = (A)⊥. Thus (ii) can be writtenas N(T ∗) = [R(T )]⊥. Note that this implies that T ∗ is one to one on theclosure R(T ) of the range of T .

2.3 Selfadjoint and positive operators

A bounded linear operator T on H is called selfadjoint if it satisfies

(Tx, y) = (x, Ty), (2.4)

for all x, y ∈ H. In this case the nullspace N(T ) = x ∈ H | Tx = 0 satisfies

N(T ) = R(T )⊥.

The converse R(T ) = N(T )⊥ is not true in general simply because therange R(T ) will not in general be closed.

The number λ is called an eigenvalue of T if there is a nonzero vectorx ∈ H with Tx = λx, that is x ∈ N(T −λI), where I is the identity operatoron H. We let

Eλ(T ) := N(T − λI) = x ∈ H | Tx = λx

denote the eigenspace associated with the eigenvalue λ. Obviously this spaceis defined wether or not λ is an eigenvalue of T . It is an eigenvalue if and onlyif Eλ(T ) 6= 0. The nonzero elements of Eλ(T ) are called the eigenvectorsassociated with the eigenvalue λ.

Proposition 2.3.1. Let T be a selfadjoint operator on H. Then λ 6= µimplies Eλ(T ) ⊥ Eµ(T ), in other words, eigenvectors with respect to differenteigenvalues are perpendicular to each other.

Proof. Assume that Tx = λx and Ty = µy. Then λ(x, y) = (Tx, y) =(x, Ty) = µ(x, y). Since λ 6= µ this implies that (x, y) = 0.

If λ = 0 then the eigenspace Eλ(T ) is simply the nullspace N(T ) andλ = 0 is an eigenvalue of T if and only if T has an nontrivial nullspace. If Tis selfadjoint this eigenspace is perpendicular to the range R(T ) and so noeigenvector associated with the eigenvalue zero is in the range of T .

By contrast, if λ 6= 0, then Eλ(T ) ⊆ R(T ) since every eigenvector asso-ciated with λ satisfies x = λ−1Tx.

A subspace V ⊆ X is called T -invariant if it satisfies T (V ) ⊆ V . In thiscase the restriction of T to V is a linear operator on V .

2.3. SELFADJOINT AND POSITIVE OPERATORS 9

Proposition 2.3.2. Let T be a selfadjoint operator on H and V ⊆ H a T -invariant subspace. Then the orthogonal complement V ⊥ is also T -invariant.

Proof. Let x ∈ V ⊥. Then for all y ∈ V we have (Tx, y) = (x, Ty) = 0, sinceTy ∈ V . Thus Tx ∈ V ⊥.

Assume that V is a closed T invariant subspace, write H = V + V ⊥ andlet T1, T2 denote the restrictions of T to V respectively V ⊥. Then

T = T1 πV + T2 πV ⊥ ,

where πV , πV ⊥ are the orthogonal projections onto the subspaces V , V ⊥.Thus the restrictions T1, T2 completely determine the operator T .

Every eigenspace Eλ(T ) of T and in particular the null space N(T ) isT -invariant. Write

H = N(T ) +W,

where W = N(T )⊥. Then the restriction of T to W is a linear operatoron W and obviously this restriction completely determines the operator T(since the restriction of T to its null space is simply zero).

Thus we will often be able to disregard the eigenvectors associated withthe eigenvalue zero, that is, the eigenvectors in the nullspace of T .

Proposition 2.3.3. If the operator T on H is selfadjoint, then

‖T‖ = sup |(Tx, x)| : ‖x‖ = 1 . (2.5)

Proof. Clearly it will suffice to show (2.5) with ”‖x‖ = 1” replaced with”‖x‖ ≤ 1”.

Set ψ(x, y) = (x, Ty). Then ψ is a bilinear form with ‖ψ‖ = ‖T‖. SinceT is selfadjoint, ψ is symmetric. Now apply (2.3).

Positive operators. A bounded linear operator A on H is called positiveif it satisfies

(Ax, x) ≥ 0, for all x ∈ H.

If strict inequality holds for all nonzero x, then A is called strictly positive.For example, if X and Y are Hilbert spaces and T : X → Y a boundedlinear operator, then the operator A = T ∗T on X is positive:

(Ax, x) = (T ∗Tx, x) = (Tx, Tx) = ‖Tx‖2 ≥ 0.

Proposition 2.3.4. If the operator A on H is positive, then every eigenvalueλ of A satisfies λ ≥ 0.


Proof. Let x be an eigenvector with eigenvalue λ. Then

λ ‖x‖2 = λ(x, x) = (Ax, x) ≥ 0.

Proposition 2.3.5. If the operator A on H is positive, then the operatorαI +A has a bounded inverse on all of H, for each α > 0.

Proof. Let α > 0 an set T = αI +A. Then, for each x ∈ H we have

‖Tx‖2 = α2 ‖x‖2 + 2α(Ax, x) + ‖Ax‖2 ≥ α2 ‖x‖2 .

It follows that T is one to one and has closed range. Moreover T is selfad-joint. Thus R(T )⊥ = N(T ) = 0. Thus T has dense range. It follows thatR(T ) = H and T has an inverse T−1 : H → H as a linear map. The inverseis bounded since ‖Tx‖ ≥ α ‖x‖ implies that

∥∥T−1y∥∥ ≤ α−1 ‖y‖.

We will also need the following result

Proposition 2.3.6. If the operator A on H is positive, then there exists aunique positive operator S on H such that A = S2.

The operator S is called the (positive) square root of A and denotedS =

√A. The existence of S is a special case of the so called continuous

functional calculus which is a consequence of the representation theory ofcommutative C∗-algebras. This theory is quite easy and provides the mostnatural proof. The reader is referred to the literature.

2.4 Compact operators between Banach spaces

Let us recall without proof some facts about compact sets in a completenormed space X. A subset A ⊆ X is called relatively compact if the closureof A is compact. The set A is called totally bounded if for each ε > 0 thereare finitely many balls B(xi, ε), xi ∈ X, of radius ε which cover A. Withthis

Theorem 2.4.1. For a subset A ⊆ X the following are equivalent:(i) A is relatively compact.(ii) A is totally bounded.(iii) Each sequence (an) ⊆ A has a subsequence which converges in X.

The proof is given in every class on metric spaces. The limit of thesubsequence in (iii) will be in the closure of A but need not be in A itself.

2.4. COMPACT OPERATORS BETWEEN BANACH SPACES 11

Let X, Y be complete normed spaces. A linear operator T : X → Y iscalled compact if the image T (B) ⊆ Y of the unit ball B ⊆ X is relativelycompact in Y . T is called a finite rank operator it the range R(T ) := T (X) ⊆Y is finite dimensional. In this case T has the form

T (x) =∑

j<nΛj(x)φj , x ∈ X, (2.6)

where n = dim(R(T )), φj ∈ Y and the Λj are continuous linear functionalson X. Simply let the φ0, . . . , φn−1 be a basis for R(T ) and Λj = ψj T ,where ψj is the coordinate functional associated with the basis vector φj,that is,

y =∑

j<nψj(y)φj , y ∈ R(T ).

Now set y = Tx. Conversely every operator of this form is finite dimen-sional with R(T ) = span(φj). Since a bounded set in a finite dimensionalspace is relatively compact (Bolzano-Weierstrass Theorem) every finite rankoperator is compact.

Theorem 2.4.2. Let X, Y be complete normed spaces and T : X → Y alinear operator.(i) If T is a finite rank operator then T is compact.(ii) If T is the limit in operator norm of compact operators, then T is com-pact.

Proof. Assume that Tn : X → Y is compact, for each n ≥ 1, and Tn → T inoperator norm. Let B ⊆ X be the unit ball and ε > 0. Choose n such that‖Tn − T‖ < ε/2. There exist finitely many balls B(yi, ε/2) ⊆ Y which coverTn(B). Then the corresponding balls B(yi, ε) cover T (B). This shows thatT (B) is totally bounded.

Let us introduce the following notation: with B(X,Y ) we denote thespace of all bounded linear operators T : X → Y . Likewise F (X,Y ) andK(X,Y ) denote the set of finite rank respectively compact operators inB(X,Y ). If X = Y , we write B(X), F (X) and K(X) for B(X,X), F (X,X)and K(X,X).

It is easily verified that F (X,Y ) and K(X,Y ) are in fact subspaces ofB(X,Y ). Then from (ii)

F (X,Y ) ⊆ K(X,Y ) ⊆ B(X,Y ).

The converse of (ii) is not true in general but it is true if X and Y areHilbert spaces as we shall see below. In other words, F (X,Y ) 6= K(X,Y )in general but we have equality in the case of Hilbert spaces X and Y .


For an operator T ∈ F (X,Y ) we set rank(T ) = dim(R(T ). If T has theform (2.6, then rank(T ) = n if φ0, . . . , φn−1 are linearly independent.

Let T ∈ K(X,Y ). Then the image T (D) ⊆ Y of each bounded subsetD ⊆ X is relatively compact. Using (2.4.1) we see

Proposition 2.4.1. Let T ∈ B(X,Y ). Then T is compact if and only if thesequence (Txn) ⊆ Y has a convergent subsequence for each bounded sequence(xn) a bounded sequence in X.

Let A be any set and τ , σ topologies on A with τ ⊆ σ. If τ is a Hausdorfftopology and A compact in the topology σ then τ = σ.

It will suffice to show that each σ-closed set F ⊆ A is τ -closed. Indeed,F is σ-compact and hence τ -compact (every cover with τ -open sets is a coverwith σ-open sets). Since τ is Hausdorff it follows that F is τ -closed.

Let X be a normed space and X∗ the space of all continuous linearfunctionals on X. Recall that the weak topology on X is the weakest topol-ogy in which all functionals F ∈ X∗ are continuous. Clearly this topologyis weaker than the norm topolgy on X. It is a Hausdorff topology (thecontinuous linear functionals on a normed space X separate points on X).

The observation above shows that the weak topology agrees with thenorm topology on every norm compact subset of X. Recall that a sequence(xn) ⊆ X satisfies xn → x weakly (in the weak topology) if and only ifF (xn) → F (x), for each continuous linear functional F ∈ X∗.

Proposition 2.4.2. Let T ∈ B(X,Y ) be compact and (xn) ⊆ X bounded.If xn → x ∈ X weakly, then Txn → Tx in norm.

Proof. Since T is bounded the weak convergence xn → x ∈ X implies theweak convergence Txn → Tx. Choose a bounded subset B ⊆ X with(xn) ⊆ B and x ∈ B. Then K = T (B) ⊆ Y is compact. Consequently theweak topology agrees with the norm topology on K. Since Txn, Tx ∈ Kand Txn → Tx weakly it follows that Txn → Tx in norm.

Remark. A weakly convergent sequence (xn) is automatically bounded, thatis the assumption of boundedness above is superfluous but we don’t needthis result. If (xn) is weakly convergent then it is weakly bounded, ie.

supn |F (xn)| <∞,

for each continuous linear functional F ∈ X∗. The Uniform BoundednessPrinciple now implies that the sequence (xn) is bounded in norm.

2.4. COMPACT OPERATORS BETWEEN BANACH SPACES 13

Exercise. Let X, Y , Z be complete normed spaces and T : X → Y ,S : Y → Z bounded linear operators. If one of S, T is compact then so isthe product S,T .

Hint: regardless of compactness T maps bounded sets to bounded sets andS maps relatively compact sets to relatively compact sets. We conclude thissection with a characterization of compact operators on Hilbert space

Theorem 2.4.3. Let X and Y be Hilbert spaces and T ∈ B(X,Y ) a boundedlinear operator. Then T is compact if and only if ‖Ten‖ → 0, for eachorthonormal sequence (en) ⊆ X.

Proof. (⇒) Assume that T is compact and let (en) ⊆ X be an orthonormalsequence. Then ∑

n|(x, en)|2 ≤ ‖x‖2 <∞

and so (x, en) → 0, as n ↑ ∞, for each x ∈ X. By the Riesz representationtheorem this means F (en) → 0, for each continuous linear functional F ∈X∗, that is, en → 0 weakly in X. According to 2.4.2 the compactness of Tnow implies Ten → 0 in norm.

(⇐) Recall that N1 denotes the closed unit ball of a normed space N .Assume that T is not compact and hence T (X1) ⊆ Y not totally bounded.Let ε > 0 be such that the closure T (X1) cannot be covered with finitelymany balls of radius 2ε. We construct an orthonormal sequence (en) ⊆ Xsuch that ‖Ten‖ ≥ ε, for all n ≥ 1.(A) We claim that for every finite dimensional subspace N ⊆ X there existse ∈ N⊥ with ‖e‖ = 1 and ‖Te‖ ≥ ε.

If this were not true let N ⊆ X be a finite dimensional subspace suchthat ‖Te‖ ≤ ε, for all e ∈ V := N⊥ with ‖e‖ ≤ 1, that is T (V1) ⊆ εY1.

Note that T (N1) ⊆ Y is compact and hence can be covered by finitelymany balls Bj(yj , ε) of radius ε. Since X1 ⊆ N1 + V1 we have T (X1) ⊆T (N1) + T (V1). It follows that T (X1) is covered by the balls Bj(yj, 2ε) incontradiction to the choice of ε. This shows (A).(B) Now we can construct the sequence (en) by induction. Using (A) withN = 0 find e0 with ‖Te0‖ ≥ ε. Given that orthonormal e0, . . . , en with‖Tej‖ ≥ ε have already been constructed set N = span(e0, . . . , en) andchoose en+1 ∈ N⊥ with ‖en+1‖ = 1 such that ‖Ten+1‖ ≥ ε. Then thesequence e0, . . . , en+1 is orthonormal and the construction continues.


2.5 Compact selfadjoint operators

Let T be a compact, selfadjoint operator on a Hilbertspace H. Then Tcan be diagonalized in the sense that there is an orthonormal basis for Hconsisting of eigenvectors of T . This result makes it very easy to work withsuch operators. For the proof we need the following

Lemma 2.5.1. Let T be a compact, selfadjoint operator on H. Then atleast one of λ = ‖T‖ or λ = −‖T‖ is an eigenvalue of T .

Proof. We may assume that T 6= 0. From (2.5) we get a sequence of vectorsxn ∈ H with ‖xn‖ = 1 and λ such that |λ| = ‖T‖ and |(Txn, xn)| → λ, asn ↑ ∞. Then, for each n ≥ 0 we have

0 ≤ ‖Txn − λxn‖2 = ‖Txn‖2 − 2λ(Txn, xn) + λ2 ‖xn‖2 (2.7)≤ ‖T‖2 − 2λ(Txn, xn) + λ2 (2.8)

As n ↑ ∞, the rightmost quantity converges to 2λ2− 2λ2 = 0. Thus we alsohave Txn − λxn → 0. Set yn = Txn. By compactness of T the sequence yn

has a convergent subsequence.Passing to this subsequence we may assume that the sequence yn is itself

convergent. But then the sequence xn = λ−1(yn− (yn−λxn) converges also.Since Txn − λxn → 0 the limit x = limn xn must satisfy Tx = λx. Since‖xn‖ = 1, for all n, we have ‖x‖ = 1.With this we can now prove the main result about compact selfadjoint op-erators:

Theorem 2.5.1. Let T be a compact, selfadjoint operator on H. Then thereexists an orthonormal basis for H consisting of eigenvectors of T . More pre-cisely N(T )⊥ has a countable orthonormal basis (φj) consisting of eigenvec-tors of T and if λj are the associated eigenvalues, then

Tx =∑

jλj(x, φj)φj , x ∈ H,

where the series converges in the norm of H. If the sequence (φj) is infinite,then λj → 0, as j ↑ ∞.

Proof. By induction we construct a (possibly finite) sequence of numbersλj 6= 0 and orthonormal vectors φj such that(i) Tφj = λjφj,(ii) the restriction Tj of T to φ0, . . . , φj−1 ⊥ satisfies ‖Tj‖ = |λj |, and(iii) T = 0 on φ0, φ1, . . . ⊥.

2.5. COMPACT SELFADJOINT OPERATORS 15

Since the λj are nonzero, each φj is in N(T )⊥ and from (iii) it follows thatthe φj span all of N(T )⊥ (recall that (A⊥)⊥ is the closed linear span of A).

The quantities λ0 and φ0 exist by lemma (2.5.1). Assume that λ0, . . . λj

and φ0, . . . , φj have already been constructed. Set

Xj = φ0, . . . , φj ⊥.

If T = 0 on Xj , then we are finished. Otherwise note that Xj is a closed T -invariant subspace (since span(φ0, . . . , φj ) is T -invariant). The restrictionTj of T to Xj is a compact selfadjoint operator on Xj . Applying lemma(2.5.1) to Tj we see that there is a unit vector φj+1 ∈ Xj and a number λj+1

such that

(a) |λj+1| = ‖Tj‖ and(b) Tφj+1 = Tjφj+1 = λj+1φj+1.

Obviously φj+1 ⊥ φ0 . . . , φj and so the resulting sequence (φj) is orthonor-mal. If Tj = 0 at any time, then (iii) is already satisfied and we are finished.

Assume now that Tj 6= 0, for all j ≥ 0, set X = φ0, φ1, . . . ⊥ and let Sbe the restriction of T to X. We must show that S = 0.

From (ii) it follows that |λ0| ≥ |λ1| ≥ · · · ≥ |λj | ≥ ‖S‖, for all j ≥ 0, andso it will suffice to show that λj → 0 as j ↑ ∞.

If λj 6→ 0, we have |λj | ≥ ρ for some number ρ > 0. Then the sequence(φj/λj) ⊆ H is bounded and by compactness of T the sequence

yj = T (φj/λj) = φj

has a convergent subsequence. However this contradicts the fact that thesequence φj is orthonormal and hence ‖φj − φk‖ =

√2, for all j 6= k.

Consequently we must have λj → 0.

Remark 2.5.1 (Spectrum). We claim that the sequence (λj) containsall the nonzero eigenvalues of T . If λ 6= λj, 0 were another eigenvalue, theassociated eigenspace would be contained in N(T )⊥ and perpendicular toall the φj which contradicts the fact that the φj span N(T )⊥. It followsthat the λj contain all the nonzero eigenvalues of T . Note also that theconvergence λj → 0 implies that the eigenspaces corresponding to nonzeroeigenvalues are all finite dimensional.

The sequence (λj) contains all nonzero eigenvalues of T but what aboutthe spectrum of T , that is the set

σ(T ) = λ ∈ R | T − λI is not invertible on H


Let us assume that H is not finite dimensional. Then the unit ball H1 is notcompact. It follows that T is not invertible, that is, 0 ∈ σ(T ) (regardless ofwether 0 is an eigenvalue or not). However, if λ 6= λj , 0, for all j ≥ 0, thenit can be shown that the operator T − λI is invertible on H. To compute(T − λI)−1 we must solve

(T − λI)x = y (2.9)

for x in terms of y. Write V = N(T ) and x = πV (x) + πV ⊥(x) as well asy = πV (y) + πV ⊥(y). With this (2.9) becomes

−λπV (x) + (T − λI)πV ⊥(x) = πV (y) + πV ⊥(y)

and since V ⊥ is T -invariant and hence T − λI-invariant, this is equivalentwith

−λπV (x) = πV (y) and (T − λI)πV ⊥(x) = πV ⊥(y) (2.10)

Since the φj are an ON-basis for V ⊥ we have πV ⊥(y) =∑

j(y, φj)φj andπV ⊥(x) =

∑j αjφj with αj to be determined. Note that (T − λI)φj =

(λj − λ)φj . With this (2.10) becomes∑jαj(λj − λ)φj =

∑j(y, φj)φj

which solves for αj = (y, φj)/(λj − λ) resulting in

x = πV (x) + πV ⊥(x) = − 1λπV (y) +

∑j

(y, φj)λj − λ

φj .

The solution x exists for each y and is a continuous linear function of y, inother words

(T − λI)−1y = − 1λπV (y) +

∑j

(y, φj)λj − λ

φj

exists as a continuous linear operator on H. Consequently the point λ is notin the spectrum of T and we have shown that

σ(T ) = λj ∪ 0.

Remark 2.5.2 (Range). The series expansion (2.5.1) also allows us todetermine the range R(T ) quite easily. Let y ∈ H and consider the equation

Tx = y. (2.11)

If this equation has a solution x, then y ∈ N(T )⊥. Assume now that y ∈N(T )⊥. Then we have an expansion y =

∑j(y, φj)φj . Clearly to find x ∈ H

2.6. COMPACT OPERATORS BETWEEN HILBERT SPACES 17

with Tx = y we can restrict ourselves to x ∈ N(T )⊥. Such x will then havean expansion

x =∑

jαjφj (2.12)

with αj to be determined. In terms of these series expansion (2.11) becomes∑jαjλjφj = Tx = y =

∑j(y, φj)φj

which implies that we must have αj = λ−1j (y, φj). However for these αj the

series (2.12) converges exactly if∑

j λ−2j |(y, φj)|2 <∞. It follows that

R(T ) =y ∈ N(T )⊥ :

∑jλ−2

j |(y, φj)|2 <∞

2.6 Compact operators between Hilbert spaces

The case of a general compact operators T : X → Y between Hilbert spacesX and Y can be reduced to the selfadjoint case by observing that the productT ∗T is a compact, selfadjoint operator on X. The results of the last sectionthen carry over with minimal changes.

Let X and Y be Hilbert spaces, T ∈ B(X,Y ). A singular system for Tis a sequence (µj, φj , ξj)j where

(i) µ0 ≥ µ1 ≥ · · · ≥ µn · · · > 0,(ii)φj is an ON-basis for N(T )⊥,(iii) ξj is an ON-basis for N(T ∗)⊥, and(iv) Tφj = µjξj and T ∗ξj = µjφj , for all j ≥ 0.

Assume that (µj, φj , ξj)j is such a system, set V = N(T )⊥ and let x ∈ X.Then the orthogonal projection πV (x) of x on V has an expansion

πV (x) =∑

j(x, φj)φj

and applying T to this expansion it follows that

Tx = TπV (x) =∑

jµj(x, φj)ξj (2.13)

with convergence pointwise on X. For φ ∈ X and ξ ∈ Y define the rank oneoperator S = φ⊗ ξ as

Sx = (x, φ)ξ ∈ Y, x ∈ X.


Then the above expansion for T can be rewritten as

T =∑

jµj(φj ⊗ ξj) (2.14)

where the series converges pointwise on X. Set

Tn =∑

j<nµj(φj ⊗ ξj) (2.15)

and let x ∈ X. Using (i) and the orthonormality of the ξn we have

‖(T − Tn)x‖2 =∥∥∥∑

j≥nµj(x, φj)ξj

∥∥∥2=

∑j≥n

µ2j |(x, φj)|2

≤ µ2n

∑j≥n

|(x, φj)|2 ≤ µ2n ‖x‖2 .

This shows that‖T − Tn‖ ≤ µn. (2.16)

in operator norm. Letting x = φn above we see that we actually have equal-ity. Consequently, if µn → 0, then the series (2.14) converges in operatornorm and hence T is compact.

Not every operator T ∈ B(X,Y ) has a singular system. However, ifX = Y and T ∈ B(X) is selfadjoint, let φj be the eigenvectors associatedwith the nonzero eigenvalues λj of T arranged in decreasing order. Then(µj , φj , ξj)j with µj = λj and ξj = φj is a singular system for T . This isexactly the content of Theorem 2.5.1. Now we generalize this fact to allcompact operators T ∈ K(X,Y ):

Theorem 2.6.1. Let T : X → Y be a compact operator, set A = T ∗T ,note that A is compact and selfadjoint on X and let φj be the eigenvectorsassociated with the nonzero eigenvalues λj of A arranged in decreasing order.Then

µj =√λj, and ξj = µ−1

j Tφj

defines a singular system (µj , φj , ξj)j for T . We have µn → 0 and hencethe series (2.14) converges in operator norm. In particular T is the limit offinite rank operators.

Proof. Note first that N(A) = N(T ) according to (2.2.1). Thus the φj arean ON-basis for N(T )⊥. By definition of (µj, φj , ξj) we have Tφj = µjξjand T ∗Tφj = µ2

jφj and this implies that T ∗ξj = µjφj . We claim that ξjis an ON-basis for N(T ∗)⊥. Indeed, for j, k ≥ 0 we have

(ξj, ξk) = (µ−1j Tφj , µ

−1k Tφk) = (µjµk)−1(T ∗Tφj, φk) = δjk. (2.17)

2.6. COMPACT OPERATORS BETWEEN HILBERT SPACES 19

Thus ξj ⊆ R(T ) ⊆ N(T ∗)⊥ := W is an orthonormal system. We claimthat this system spans all of W . Let w ∈ W and assume that w ⊥ ξj, forall j ≥ 0. Then T ∗w ∈ R(T ∗) ⊆ N(T )⊥ and

(T ∗w,φj) = (w,Tφj) = µj(w, ξj) = 0,

for all j ≥ 0. Since the φj are an ON-basis for N(T )⊥ it follows thatT ∗w = 0, that is w ∈ N(T ∗) = W⊥. Thus w ∈W ∩W⊥ = 0. This showsthat the orthonormal system ξj in N(T ∗)⊥ is complete. .

Remark. If T : X → Y is any bounded linear operator and (φj) an ON-basisfor V = N(T )⊥, then the expansion (2.1) is valid and applying T to thisexpansion yields

Tx =∑

j(x, φj)Tφj .

What makes the expansion (2.13) interesting is the additional informationcontained in the singular system for T .

Remark 2.6.1 (Adjoint). Recall that T ∗∗ = T . If (µj, φj , ξj)j is a singularsystem for T then (µj , ξj , φj)j is a singular system for T ∗ and so we havethe expansion

T ∗y =∑

jµj(ξj ⊗ φj).

Thus if T is compact then so is the adjoint T ∗.

Remark 2.6.2 (Range). The expansion (2.13) allows us to work with ON-bases just as in the case of a compact selfadjoint operator. As an examplewe determine the range R(T ), that is, we study the equation

Tx = y. (2.18)

Fix y ∈ Y . If a solution exists, then y ∈ R(T ) ⊆ N(T ∗)⊥. Now assume thaty ∈ N(T ∗)⊥. Then we have an expansion y =

∑j(y, ξj)ξj . If there exists

any solution x of (2.18) in X, then there exists a solution in V = N(T )⊥ (infact πV (x) is one). Thus we may assume that x ∈ V and have an expansion

x =∑

jαjφj. (2.19)

Applying T to this yields∑jαjµjξj = Tx = y =

∑j(y, ξj)ξj.


It follows that we must have αj = µ−1j (y, ξj). With this the series for x

converges exactly if∑

j µ−2j |(y, ξj)|2 <∞. Consequently

R(T ) =y ∈ N(T ∗)⊥ :

∑jλ−2

j |(y, ξj)|2 <∞

(2.20)

exactly as in the selfadjoint case.

2.7 Hilbert-Schmidt and trace class operators

Let X, Y be Hilbert spaces, T ∈ K(X,Y ) compact and (µj , φj , ξj)j a singu-lar system for T . We know from Theorem 2.6.1 that T is the limit in operatornorm of finite operators. Now we quantify the speed of convergence.

Approximation numbers. Set

Tn =∑

j<nµj(φj ⊗ ξj).

We have seen that then‖T − Tn‖ ≤ µn. (2.21)

On the other hand we show now that

‖T − S‖ ≥ µn, (2.22)

for each finite rank operator S ∈ F (X,Y ) with rank(S) ≤ n. Set Xn =span(φ0, . . . , φn and note that

‖Tx‖ ≥ µn ‖x‖ , for all x ∈ Xn. (2.23)

Let x ∈ Xn. Then x =∑

j≤n(x, φj)φj and so Tx =∑

j≤n µj(x, φj)ξj. Itfollows that

‖Tx‖2 =∑

j≤nµ2

j |(x, φj)|2 ≥ µ2n

∑j≤n

|(x, φj)|2 = µ2n ‖x‖2 .

Now let S ∈ F (X,Y ) with dim(R(S)) ≤ n. Then S is not one to one on Xn

and so there exists a unit vector u ∈ Xn with Su = 0. Using (2.23) we have

‖(T − S)u‖ = ‖Tu‖ ≥ µn.

Thus ‖T − S‖ ≥ µn. The quantities

an(T ) := inf ‖T − S‖ : S ∈ F (X,Y ), rank(S) ≤ n , n ≥ 0, (2.24)

2.7. HILBERT-SCHMIDT AND TRACE CLASS OPERATORS 21

are called the approximation numbers of T . Here a0(T ) = ‖T‖. The esti-mates (2.21) and (2.22) show that

µn = an(T ) (2.25)

and that the operator S = Tn provides the best approximation of T inthe operator norm among all operators of rank at most n. In particularthis shows that the numbers µn in a singular system for T are uniquelydetermined by T and do not depend on the singular system.

The µn are called the singular values of T . Obviously the vectors φn

and ξn in a singular system for T are not uniquely determined. Consider theselfadjoint case and note that there are many ways to extract an orthonormalbasis from each eigenspace of T .

The approximation numbers an(T ) are defined for each bounded linearoperator T ∈ B(X,Y ). T is compact if and only if an(T ) → 0, as n ↑ ∞and this is the only case of interest. In this case we have an(T ) = µn, wherethe µj are the singular values of T (square root of the eigenvalues of T ∗T ).For each bounded linear operator T ∈ B(X,Y ) let

‖T‖ =(∑

an(T )p)1/p

and letSp(X,Y ) = T ∈ B(X,Y ) : ‖T‖p <∞.

Clearly each T ∈ Sp(X,Y ) is compact. One can show that Sp(X,Y ) ⊆B(X,Y ) is a closed subspace but we won’t need this result. We are onlyinterested in the cases p = 1, 2.

We now assume that T ∈ K(X,Y ) is compact and (µj , φj , ξj)j a singularsystem for T .

Hilbert-Schmidt operators. The operator T is called a Hilbert-Schmidtoperator, if T ∈ S2(X,Y ), that is,

‖T‖22 :=∑

nan(T )2 =

∑nµ2

n <∞.

Proposition 2.7.1. If T ∈ K(X,Y ) is compact and eα is any ON-basisfor X, then

‖T‖22 =∑

α‖Teα‖2 .

Remark. It follows that T is a Hilbert-Schmidt operator if and only if∑α ‖Teα‖

2 < ∞, for some ON-basis eα of X and in this case the sum isindependent of the choice of the basis eα.


We do not assume that X is separable, that is that the basis eα iscountable. However since T and hence T ∗ are compact the entire actionis essentially separable: both N(T )⊥ and R(T ) = N(T ∗)⊥ have countableON-bases.

Proof. Let eα be any ON-basis for X. Since ξk is an ON-basis forN(T ∗)⊥ = R(T ), we have

‖Teα‖2 =∑

k|(Teα, ξk)|2 =

∑k|(eα, T ∗ξk)|2 =

∑kµ2

k|(eα, φk)|2,

for each α. It follows that∑α‖Teα‖2 =

∑α

(∑kµ2

k|(eα, φk)|2)

=∑

kµ2

k

(∑α|(eα, φk)|2

)=

∑kµ2

k ‖φk‖2 =∑

kµ2

k = ‖T‖22 .

Hilbert-Schmidt operators on the space X = L2(ν) of square integrablefunctions with respect to a finite measure ν will be characterized in termsof integration kernels below.

Trace class operators. We now assume that X and Y have the sameorthogonal dimension, that is, ON-bases eα of X and fα of Y can beindexed with the same indices α. Because of the compactness of T we caneven assume both spaces to be separable. The operator T is called a traceclass operator, if T ∈ S1(X,Y ), that is,

‖T‖1 =∑

nan(T ) <∞.

Recall that (µj , φj , ξj)j denotes a singular system for T . It follows that‖T‖1 =

∑n µn.

Proposition 2.7.2. Let T ∈ K(X,Y ). Then

‖T‖1 = max∑

α|(Teα, fα)|, (2.26)

where the maximum is taken over all ON-bases eα of X and fα of Y .

Proof. Let eα and fα be ON-bases of X and Y and write Teα =∑j µj(eα, φj)ξj . It follows that∑

α|(Teα, fα)| ≤

∑α

∑jµj|(eα, φj)||(ξj , fα)|

=∑

jµj

∑α|(eα, φj)||(ξj , fα)|

≤∑

jµj

(∑α|(eα, φj)|2

)1/2 (∑α|(eα, ξj)|2

)1/2

≤∑

jµj ‖φj‖ ‖ξj‖ =

∑kµk = ‖T‖1 .

2.7. HILBERT-SCHMIDT AND TRACE CLASS OPERATORS 23

On the other hand if we enlarge the bases φj of N(T )⊥ ⊆ X and ξj ofR(T ) ⊆ Y to ON-bases eα of X and fα of Y , then T vanishes on alleα 6∈ φj and the above sum becomes∑

α|(Teα, fα)| =

∑j|(Tφj , ξj)| =

∑jµj = ‖T‖1 .

Thus T is a trace class operator if and only if the sum (2.26) is finite for allON-bases eα of X and fα of Y .

Proposition 2.7.3. Let T ∈ K(X,Y ). Then

‖T‖1 = min∑

n‖xn‖ ‖yn‖ , (2.27)

where the minimum is taken over all sequences (xn) ⊆ X and (yn) ⊆ Y suchthat T =

∑n xn ⊗ yn.

Proof. Assume that T =∑

n xn ⊗ yn. Let eα and fα be ON-bases Xand Y and write Teα =

∑n(eα, xn)yn. With this∑

α|(Teα, fα)| ≤

∑α

∑n|(eα, xn)||(yn, fα)|

≤∑

n

(∑α|(eα, xn)|2

)1/2 (∑α|(yn, fα)|2

)1/2

=∑

n‖xn‖ ‖yn‖ .

Taking the sup over all such bases eα and fα yields

‖T‖1 ≤∑

n‖xn‖ ‖yn‖ .

Conversely, if we set xn = µnφn and yn = ξn, then T =∑

n xn ⊗ yn and∑n‖xn‖ ‖yn‖ =

∑nµn = ‖T‖1 .

Thus T is a trace class operator if and only if T has the form T =∑

n xn⊗yn

with∑

n ‖xn‖ ‖yn‖ <∞.

Trace. Assume now that X = Y and T ∈ K(X) is a trace class operator.Then we define the trace of T as

tr(T ) =∑

α(Teα, eα), (2.28)


where eα is an ON-basis for X. The series converges absolutely but wehave to show that the sum does not depend on the choice of the basis eα.Fix a representation of T as

T =∑n

xn ⊗ yn with∑n

‖xn‖ ‖yn‖ <∞ (2.29)

and let eα be an ON-basis for X. For n ≥ 0 write xn =∑

α(xn, eα)eα andyn =

∑β(yn, eβ)eβ . Entering this into the inner product (xn, yn) we obtain

(xn, yn) =∑

α,β(xn, eα)(yn, eβ)(eα, eβ) =

∑α(xn, eα)(yn, eα).

Now, for each α, write Teα =∑

n(eα, xn)yn. With this we have∑α(Teα, eα) =

∑α

∑n(eα, xn)(yn, eα).

Using Cauchy-Schwartz on sums along α we see that the double series onthe right is absolutely convergent. We can thus rearrange it to obtain∑

α(Teα, eα) =

∑n

∑α(eα, xn)(yn, eα) =

∑n(xn, yn).

This shows that the value of the sum∑

α(Teα, eα) does not depend on thechoice of basis eα. Thus the trace tr(T ) is well defined. But note that therepresentation of T as (2.29) was also arbitrary. Thus

Proposition 2.7.4. Let T =∑

n xn ⊗ yn with∑

n ‖xn‖ ‖yn‖ <∞. Then

tr(T ) =∑

n(xn, yn).

We will see concrete examples of trace class operators and compute theirtrace in the treatment of kernel reproducing Hilbert spaces.

2.8. INVERSE PROBLEMS AND REGULARIZATION 25

2.8 Inverse problems and regularization

Let X, Y be Hilbert spaces, T ∈ B(X,Y ) be a bounded linear operator, andw ∈ R(T ). We will study the equation

Tv = w (2.30)

to be solved for v ∈ X (inverse problem). Imagine that (2.30) arises fromthe theoretical study of some physical system and all ingredients are knownwith perfect precision. We call this problem the clean inverse problem. Byassumption this problem has a solution.

Unfortunately we do not know the ”true data” w. We have only a”polluted” version y of w and must instead solve the practical problem

Tx = y (2.31)

where ‖y − w‖ is small. We do not know wether y ∈ R(T ), that is, theequation (2.31) may not have a solution, nonetheless it is all that we have towork with. We are interested in the solution v of (2.30) and not any solutionx of (2.31).

We reason as follows: each solution v of (2.30) is an approximate solutionof (2.31), that is, ‖Tv − y‖ = ‖w − y‖ is small. Therefore let us seek x ∈ Xsuch that ‖Tx− y‖ is small. Then

‖Tx− Tv‖ = ‖Tx− w‖ ≤ ‖Tx− y‖+ ‖y −w‖

will be small and from this we hope to be able to conclude that ‖x− v‖is small, that is, x is a good approximation of the solution v of the cleanproblem (2.30). We are thus led to replace (2.31) with the minimizationproblem

minx∈X ‖Tx− y‖ (2.32)

Obviously each solution of (2.31) will be a minimizer of (2.32) but suchsolutions need not exist. Recall that our approach is based on the followingreasoning

‖Tx− Tv‖ is small ⇒ ‖x− v‖ is small.

Since ”small” is somewhat vague we might want to require more stronglythat ‖x− v‖ → 0 whenever ‖Tx− Tv‖ → 0. This implies that N(T ) = 0(thus T is invertible on R(T )) and the inverse S = T−1 : R(T ) → X iscontinuous. In this case the problem (2.30) is called well posed, otherwise itis called ill posed. In short the inverse problem (2.30) is ill posed if


(i) the solution v is not unique, or(ii) it is unique but is not a continuous function of w ∈ R(T ).

The usual definition of well posedness requires that the solution v of (2.30)exist for all w ∈ Y , be uniquenely determined and be a continuous functionof w. The continuity requirement is then superfluous. It follows from thefirst two (N(T ) = 0 and R(T ) = Y ) by the Open Mapping Theorem.

This fact is usually ignored and creates an akward situation as the con-tinuous dependence of the solution on the right hand side is the very essenceof well posedness.

We are not following this lead here since it is unreasonable to requirethe existence of a solution of (2.30) for each right hand side w ∈ Y . Inpractice we are only dealing with one particular right hand side w ∈ Y andthe natural assumption is that (2.30) have a solution for the given righthand side w, that is, w ∈ R(T ). It does not make much sense to seek thesolution of a problem which does not have a solution by virtue of the theoryunderlying the problem.

Condition (ii) is the crucial one. The uniqueness of the solution canalways be enforced if we replace X with N(T )⊥. Unfortunately in manycases of practical interest the operator T is compact and the space Y infinitedimensional.

In this case (ii) is guarenteed to fail. We now devise a strategy to copewith the ill posedness of (2.30) rephrased as the minimization problem

minx∈X ‖Tx− y‖ . (2.33)

Note that we do not assume that T is compact, it is a general bounded linearoperator T ∈ B(X,Y ).

The minimization problem (2.33) does not always have a solution. Thismerely means that no minimum is assumed. We can still get arbitrarilyclose to the infimum and thus hope to find x such that ‖Tx− y‖ is small.

However, in this case, it is not clear how to find such x. On the otherhand if minimizers do exist we can hope that they have special propertieswhich make them easy to find. Indeed, we will see below that they can beobtained as a solution of the so called normal equation.

Let W = R(T ) = N(T ∗)⊥. Then the orthogonal projection b = πW (y)is the unique element u ∈ W which minimizes the distance ‖y − u‖. Notethat T ∗b = T ∗y.

Since b can be approximated arbitrarily closely with elements in therange of T , an element x ∈ X minimizes (2.33) if and only if Tx = b and

2.8. INVERSE PROBLEMS AND REGULARIZATION 27

since the operator T ∗ is one to one on W (2.2.1) this is equivalent with

T ∗Tx = T ∗b = T ∗y. (2.34)

This equation is called the normal equation associated with the minimizationproblem (2.33). It has a solution exactly if b ∈ R(T ).

This condition is automatically satisfied if T has closed range. In thiscase the minimizers of (2.33) are exactly the solutions of (2.34) plus arbitraryvectors in N(T ∗T ) = N(T ). The unique solution in N(T )⊥ is the solutionwith minimal norm.

Example 2.8.1 (Polynomial least sqares interpolation). If X and Yare finite dimensional then R(T ) = R(T ) and thus b ∈ R(T ) is automaticallysatisfied, that is, the normal equations do have a solution.

Consider the following example: let n ≥ 1 and assume we are given npairs of points (x1, y1), . . . , (xn, yn) ∈ R2. We want to find a polynomial Qof fixed degree k which minimizes the squared error∑n

j=1|Q(xj)− yj |2 = ‖Qx − y‖2

where Qx = (Q(x1), . . . , Q(xn)) and y = (y1, . . . , yn) are vectors in Rn. Theerror is computed in the Euclidean norm of Rn and with this norm Rn is aHilbert space. Write

Q(x) = a0 + a1x+ · · ·+ akxk.

and identify the polynomial Q with the vector a = (a0, . . . , ak) ∈ Rk+1 ofits coefficients. With this identification

Qx = Ta,

where T : Rk+1 → Rn is the linear operator given by the matrix

T =

1 x1 x2

1 . . . xk1

1 x1 x21 . . . xk

1. . .

1 xn x2n . . . xk

n

and the normal equations T ∗Ta = T ∗y can be solved for the coefficients a ofQ. In the special case of linear least squares interpolation (k = 1) we have

T ∗T =(

1 1 . . . 1x1 x2 . . . xn

)1 x1

1 x1

. . .1 xn

=(

n∑xk∑

xk

∑x2

k

)


and the normal equations assume the form

a0n+ a1

∑xk =

∑yk

a0

∑xk + a1

∑x2

k =∑

xkyk

Divide by n and write Ex = n−1∑xk, Ey = n−1

∑yk, Exx = n−1

∑x2

k

and Exy = n−1∑xkyk (”E” for expected value) to obtain

a0 + a1Ex = Ey

a0Ex+ a1Exx = Exy

with solution

a0 =ExExy − EyExx

Ex2 − Exxand a1 =

ExEy − Exy

Ex2 − Exx.

2.8.1 Regularization

In general the normal equation (2.34) suffers from the same drawbacks asthe inverse problem (2.30). If the operator T is compact then so is T ∗T andhence the normal equation (2.34) is an ill posed inverse problem.

Now let α > 0 be a small positive number. Then the operator αI +T ∗Tis invertible on X with bounded inverse (2.3.5). If we replace the normalequation (2.34) with

(αI + T ∗T )x = T ∗y (2.35)

we have a well posed problem with solution x = (αI + T ∗T )−1T ∗y. Thefollowing proposition shows that x is the solution to a modified minimizationproblem:

Proposition 2.8.1. The solution x = (αI + T ∗T )−1T ∗y to the regularizednormal equations (2.35) is the unique minimizer of

minx∈X

(‖Tx− y‖2 + α ‖x‖2

). (2.36)

Proof. Let F := X×Y be the product space endowed with the inner product

(a⊕ b, u⊕ v)F := α(a, u)X + (b, v)Y , a, u ∈ X, b, v ∈ Y

with associated norm

‖x⊕ y‖2 := ‖y‖2 + α ‖x‖2 , x ∈ X, y ∈ Y.

2.9. KERNELS AND INTEGRAL OPERATORS 29

Now consider the bounded linear operator S : X → F defined as

Sx = x⊕ Tx, x ∈ X.

Then ‖Sx− 0⊕ y‖2F = ‖Tx− y‖2Y + α ‖x‖2X . With this the minimization

problem (2.36) has been rewritten as the ordinary minimization problem

minx∈X ‖Sx− 0⊕ y‖2F (2.37)

Note that ‖Sx‖2 ≥ α ‖x‖2 and hence S has closed range and is one to one.It follows that (2.37) has a unique solution and it is the unique solution xof the normal equation

S∗Sx = S∗(0⊕ y) (2.38)

for the operator S. Let us compute the adjoint operator S∗ : F → X. Fora, u ∈ X and v ∈ Y we have

(Sa, u⊕ v)F = (a⊕ Ta, u⊕ v)F = α(a, u) + (Ta, v)= α(a, u) + (a, T ∗v) = (a, αu + T ∗v)

from which it follows that

S∗(u⊕ v) = αu+ T ∗v, u ∈ X, v ∈ Y. (2.39)

With this S∗Sx = S∗(x⊕Tx) = αx+T ∗Tx = (αI +T ∗T )x and the normalequation for S becomes

(αI + T ∗T )x = S∗(0⊕ y) = T ∗y

with unique solution x = (αI + T ∗T )−1T ∗y which is the unique minimizerof (2.37) and hence of (2.36).

2.9 Kernels and integral operators

Let now F ⊆ Rn be a closed subset and dx denote Lebesgue measure on F .We do not assume that F is compact. In particular we could have F = Rn.Recall that L2(F ) denotes the Hilbert space of all square integrable functionsf : F → R with inner product

(f, g) =∫

Xf(x)g(x)dx, f, g ∈ L2(F ).


The Hilbert-Schmidt operators on L2(F ) are exactly the integral operatorswith respect to square integrable kernels on F . These notions will be intro-duced shortly. The restriction to Lebesgue measure on subsets F of Rn isnot essential. The results remain true with identical proofs for every finitemeasure space F = (Ω,F , P ). However we need only the case of Lebesguemeasure.

A kernel on F is a function K : F × F → R. We write K = K(x, y),x, y ∈ F . For each x ∈ F define the kernel function Kx : F → R as

Kx(y) := K(x, y), y ∈ F.

The kernel K is called square integrable if K ∈ L2(F × F ), that is, K ismeasurable and

‖K‖22 :=∫

F×F|K(x, y)|2dxdy <∞.

By Fubini’ theorem ∫F‖Kx‖2

2 dx = ‖K‖22 <∞

and hence ‖Kx‖2 <∞, that is, Kx ∈ L2(F ), for almost every x ∈ F . Fromnow on we assume that the kernel K is square integrable. Then K definesa linear operator TK : L2(F ) → L2(F ) via

(TKf)(x) :=∫

FK(x, y)f(y)dy = (f,Kx), f ∈ L2(F ), x ∈ F. (2.40)

(TKf)(x) is defined only for all x such thatKx ∈ L2(F ) and hence for almostevery x ∈ F . If the kernel K is bounded, then Kx is bounded and hencesquare integrable for every x ∈ F . In this case (TKf)(x) is defined for everyx ∈ F (even though f(x) may be defined only almost surely).

Next we represent TKf as a vector valued integral in L2(F ). See Ap-pendix, A) for integrals with values in a Hilbert space.

Proposition 2.9.1. TKf is the L2(F )-valued Bochner integral

TKf =∫

Xf(y)Ky dy ∈ L2(F ), f ∈ L2(F ). (2.41)

Proof. Let us show that the integral (2.41) converges in the norm of L2(F ).Set

I :=∫

F‖f(y)Ky‖2 dy =

∫F|f(y)| ‖Ky‖2 dy


and use the Cauchy-Schwartz inequality to obtain

I2 ≤∫

F|f(y)|2dy

∫F‖Ky‖2

2 dy = ‖f‖22 ‖K‖22 <∞. (2.42)

To verify the equality (2.41) it remains to be shown that

(TKf, g) =∫

Ff(y)(Ky, g)dy, (2.43)

for all g ∈ L2(F ). Writing out both sides as integrals this becomes∫F

(∫F

K(x, y)f(y)dy)g(x)dx =

∫F

f(y)(∫

F

K(x, y)g(x)dx)dy

and is satisfied as a consequence of Fubini’s theorem.

Remark. Since ‖TKf‖ ≤ I, (2.42) shows that ‖TKf‖ ≤ ‖K‖2 ‖f‖. Thus thelinear operator TK is bounded with

‖TK‖ ≤ ‖K‖2 .

Our kernels are required to be defined everywhere on F × F . However, ifK1 and K2 are square integrable kernels satisfying K1(x, y) = K2(x, y) foralmost every (x, y) ∈ F×F , then the kernel functions satisfy (K1)x = (K2)x,almost surely on F , for almost every x ∈ F , and it follows that the associatedintegral operators satisfy TK1 = TK2.

Consequently, in the study of the integral operators TK we can identifythe kernel K with its equivalence class in L2(F × F ). With this the spaceof square integrable kernels on F becomes the Hilbert space L2(F × F ).

We have defined kernels to be functions defined everywhere on F × Frather than equivalence classes in in L2(F × F ) because this is needed forthe definition of positive definite kernels below.

The operator TK is called the integral operator with kernel K. It is easilyseen that the map

T : K ∈ L2(F × F ) 7→ TK ∈ B(L2(F ))

is linear. Recall that ‖TK‖ ≤ ‖K‖2. Thus T is a contraction from the spaceof square integrable kernels on X to the space of bounded linear operatorson L2(F ).

The adjoint T ∗K of the integral operator TK is another integral operatorTK∗ associated with the square integrable kernel K∗ defined as

K∗(x, y) = K(y, x), x, y ∈ F,


that is (TKf, g) = (f, TK∗g), for all f, g ∈ L2(F ). This is an easy conse-quence of Fubini’s theorem.

The kernel K is called symmetric if K(x, y) = K(y, x), for all x, y ∈ F .In this case the operator TK is selfadjoint. The converse obviously is nottrue since the integral operator TK determines the kernel K only almostsurely on F × F .

Example 2.9.1 (Truncated kernel). Let K ∈ L2(F × F ) and φ be abounded measurable function on F . Define the new kernel Kφ ∈ L2(F ×F )as

Kφ(x, y) = φ(x)K(x, y)φ(y)

and letMφ : f ∈ L2(F ) 7→ φf ∈ L2(F )

be the multiplication operator. Then the associated integral operator TKφ

satisfies

(TKφf)(x) = φ(x)∫

FK(x, y)φ(y)f(y)dy = φ(x)[TK(φf)](x),

for f ∈ L2(F ). In other words we have TKφ = MφTKMφ. This can be usedto truncate the kernel K so as to have compact support.

Example 2.9.2 (One dimensional kernel). Let φ,ψ ∈ L2(F ) be definedeverywhere on F . A kernel K of the form

K(x, y) = φ(x)ψ(y) (2.44)

is called a one dimensional kernel. The kernel functions are given by Kx =φ(x)ψ and consequently, (TKf)(x) = (f, kx) = φ(x)(f, ψ), for all f ∈ L2(F )and x ∈ F . Thus

TKf = (f, ψ)φ, f ∈ L2(F ),

that is, TK is the rank one operator

TK = ψ ⊗ φ.

Example 2.9.3 (Degenerate kernel). The linear combinations

K(x, y) =∑N

i=1λiφi(x)ψi(y) (2.45)

of one dimensional kernels are called degenerate kernels. For such K theintegral operator TK is the finite rank operator

TK =∑N

i=1λi(ψi ⊗ φi),


that is, TKf =∑N

i=1 λi(f, ψi)φi, for all f ∈ L2(F ). This follows from theone dimensional case above by linearity of the map K 7→ TK .

Theorem 2.9.1. If K ∈ L2(F × F ) is any kernel, then the linear operatorTK : L2(F ) → L2(F ) is compact.

Proof. To see that TK is compact it will suffice to show that K is the limitin L2-norm of degenerate kernels (2.9.3). This then implies that TK is thelimit in operator norm of finite rank operators and hence compact.

We have to show that the closed linear span of functions φ(x)ψ(y) ∈L2(F ×F ) with φ,ψ ∈ L2(F ) is all of L2(F ×F ). For this it suffices to showthat K ⊥ φ(x)ψ(y), for all φ,ψ ∈ L2(F ) implies that K = 0. Assume that

K ⊥ φ(x)ψ(y) in L2(F × F ), for all φ,ψ ∈ L2(F ). (2.46)

In particular then∫A×B

K(x, y)dxdy = (K, 1A(x)1B(y)) = 0,

for all Borel sets A,B ⊆ F . The finite disjoint unions of measurable rect-angles A × B form an algebra of sets which generates the Borel σ-field onF × F . Standard extension theorems of measure theory now imply that∫B K = 0, for all Borel sets B ⊆ F × F , and hence K = 0.

We are now ready to show that the integral operators TK with squareintegrable kernels TK are exactly the Hilbert-Schmidt operators on H =L2(F ):

Theorem 2.9.2. Let H = L2(F ) and K ∈ L2(F ×F ) be a square integrablekernel. Then TK is a Hilbert-Schmidt operator on H with ‖T‖2 = ‖K‖2.

Conversely for every Hilbert-Schmidt operator T on H there exists akernel K ∈ L2(F × F ) with T = TK .

Proof. Let eα be an OB-basis for H. We may assume that each eα is afunction defined everywhere on F .

Assume first that T = TK with K ∈ L2(F × F ). Then, for each α andx ∈ F , we have (TKeα)(x) = (eα, kx) and so

‖TKeα‖22 =∫

F|(eα, kx)|2dx

Commuting a sum with an integral we obtain

‖T‖22 =

∑α‖TKeα‖22 =

∫F

(∑α|(eα, kx)|2

)dx =

∫F‖kx‖2 dx

= ‖K‖22 <∞.


Thus T is a Hilbert-Schmidt operator on H. Conversely, assume that T is aHilbert-Schmidt operator on H and set

K(α)(x, y) := (Teα)(x)eα(y), x, y ∈ X.

Then K(α) ∈ L2(F ×F ) is a square integrable kernel with TK(α) = eα⊗Teαand consequently

TK(α)eβ =Teβ, if β = α

0 if β 6= α.

Suppose now the seriesK =

∑αKα (2.47)

were known to converge in L2(F×F ). Then the sumK is a square integrablekernel on F and since the map K ∈ L2(F ×F ) 7→ TK ∈ B(L2(F )) is a linearcontraction, we have

TK =∑

αTK(α)

where the series converges in operator norm and so in particular pointwiseon L2(F ). Thus, for each index β, we have

TKeβ =∑

αTK(α)eβ = Teβ

from which it follows that TK = T . To see that the series (2.47) convergesin L2(F × F ) note that the K(α) ∈ L2(F × F ) satisfy

(K(α),K(β)) = (Teα, T eβ)(eα, eβ),

that is, they are pairwise orthogonal with norms satisfying∑α‖K(α)‖2

2 =∑

α‖Teα)‖2

2 = ‖T‖22 <∞.

This implies the convergence (2.47) in the Hilbert space L2(F × F ).

2.10 Symmetric kernels

We are mainly interested in symmetric kernels K ∈ L2(F ×F ). For such Kthe integral operator T = TK on L2(F ) is compact and selfadjoint. Let usreview the results of the spectral theory of compact selfadjoint operators ona Hilbert space (2.5.1):

The complement N(T )⊥ of the nullspace of T is the orthogonal sum ofthe eigenspaces Ej(T ) corresponding to the nonzero eigenvalues λj of T .

2.10. SYMMETRIC KERNELS 35

In other words N(T )⊥ has an orthonormal basis consisting of eigenvectorsφj : j ≥ 0 corresponding to the nonzero eigenvalues λj of T . We have

T =∑

jλj(φj ⊗ φj),

where λj is the eigenvalue corresponding to the eigenvector φj and the seriesconverges in operator norm. For f ∈ L2(F ) we have

Tf =∑

jλj(f, φj)φj , (2.48)

with convergence in L2-norm. The nullspace N(T ) satisfies N(T ) = R(T )⊥

and R(T ) = N(T )⊥. The range of T is not closed unless T is finite di-mensional. The eigenvectors with respect to the eigenvalue zero (ie. thenonzero elements of N(T )) can be disregarded since the orthogonal projec-tion πV f of an element f ∈ L2(F ) onto the complement V = N(T )⊥ satisfiesTf = TπV f and

πV f =∑

j(f, φj)φj .

The kernel functions Kx are square integrable for almost every x ∈ F andfor each such x we have Kx ∈ N(T )⊥. Indeed, if f ∈ N(T ) and Kx ∈ L2(F ),then (f,Kx) = (Tf)(x) = 0. Moreover for such x

(Kx, φj) = (Tφj)(x) = λjφj(x).

Consequently the expansion of kx in terms of the ON-basis φj of N(T )⊥

assumes the form

Kx =∑

j(Kx, φj)φj =

∑jλjφj(x)φj (2.49)

with convergence in L2-norm from which it follows that

‖Kx‖22 =∑

jλ2

jφ2j (x). (2.50)

Integrating this over F it follows that the eigenvalues λj satisfy∑jλ2

j =∫

F‖Kx‖22 dx = ‖K‖22 <∞. (2.51)

Purely formally we can evaluate (2.49) at a point y ∈ F to obtain theexpansion of the kernel K as

K(x, y) = Kx(y) =∑

jλjφj(x)φj(y).

Of course this reasoning is not valid since L2-convergence in (2.49) does notimply convergence pointwise almost surely. Pointwise convergence of thisexpansion will be obtained below for continuous, positive definite kernels(Mercer’s theorem).


2.11 L2-Bounded Kernels

The kernel K is called L2-bounded if the family of kernel functions Kx ∈L2(F ) is bounded, that is, there exists C <∞ such that

‖Kx‖2 ≤ C, for all x ∈ F.

Assume now that K is L2-bounded. We will see that then the eigenvectorexpansion (2.48) converges (pointwise) uniformly on F . Note that it doesnot make sense in general to evaluate an element f ∈ L2(F ) at all pointsx ∈ F since f is an equivalence class of functions defined almost surely withno distinguished representative defined everywhere.

By contrast the element TKf is defined everywhere on F via the for-mula (2.40). This applies to all eigenvectors φj corresponding to nonzeroeigenvalues λj since these are in the range of TK . It thus makes sense toinvestigate pointwise evcerywhere convergence of the eigenvector expansion(2.48).

Let f ∈ L2(F ) and note that we have |(TKf)(x)| = |(f,Kx)| ≤ C‖f‖2,for all x ∈ F . Taking the sup over all x ∈ F we obtain

‖TKf‖∞ ≤ C ‖f‖2 , f ∈ L2(F ). (2.52)

It follows that∥∥∥TKf −∑

j≤nλj(f, φj)φj

∥∥∥∞

=∥∥∥TK

(f −

∑j≤n

(f, φj)φj

)∥∥∥∞

≤∥∥∥f −∑

j≤n(f, φj)φj

∥∥∥2→ 0,

as n ↑ ∞. Let the series extend over the eigenvectors in the kernel of TK

also (hence a complete basis for L2(F )), these terms have correspondingeigenvalue zero. This shows that for a bounded kernel K the eigenvectorexpansion

TKf =∑

jλj(f, φj)φj , f ∈ L2(F ), (2.53)

converges (pointwise) uniformly on F .

Chapter 3

Reproducing Kernel HilbertSpaces

Recall that all Hilbert spaces are real Hilbert spaces, that is, the scalar fieldis the field of real numbers. Note the following consequence of using the realsas scalars: a real matrix which is positive definite need not be symmetric.Consequently shall have to require symmetry explicitly.

We now turn to the class of Hilbert spaces H useful in regression. LetF be any nonempty set. We are interested in Hilbert spaces H consisting offunctions f : F → R and having the property that the evaluation functionals

Ex : f ∈ H 7→ f(x) ∈ R, x ∈ F,

are continuous. Such a Hilbert space is called a reproducing kernel Hilbertspace. The need for the continuity of the evaluation functionals in regressionproblems is motivated in the introduction.

A reproducing kernel Hilbert space on F defines a symmetric, positivesemidefinite kernel K : F × F → R. Let us turn our attention to thesekernels.

3.1 Positive semidefinite kernels

A kernel K on F is simply a function K : F × F → R. K is called sym-metric if it satisfies K(x, y) = K(y, x), for all x, y ∈ F . K is called positivesemidefinite if we have ∑N

i,j=1wiwjK(xi, xj) ≥ 0, (3.1)

37

38 CHAPTER 3. REPRODUCING KERNEL HILBERT SPACES

for all wj ∈ R and xj ∈ F , that is, the matrix A = (K(xi, xj)Ni,j=1 is positivesemidefinite. The kernel K is called positive definite if strict inequality holdsin (3.1) for all pairwise distinct points xj ∈ F . In this case the matrix A isinvertible.

Although no structure is assumed to be present on F unless explicitlystated otherwise, interesting examples arise only in the presence of additionalstructure such as measure or topology.

Example 3.1.1 (Simple kernel). If φ : F → R is any function, thenK(x, y) := φ(x)φ(y) defines a symmetric positive semidefinite kernel on F .Indeed, if wi ∈ R and xi ∈ F we have∑

i,jwiwjK(xi, xj) =

∣∣∣∑iwiφ(xi)

∣∣∣2 ≥ 0.

A special case of this is the constant kernel K(x, y) = c ≥ 0. Let Kn be asymmetric positive semidefinite kernels on F and λn ≥ 0 for n ≥ 0. ThenK = λ1K1 + λ2K2 is a symmetric positive semidefinite kernel on F . IfKn → K pointwise on F × F , then K is a symmetric positive semidefinitekernel on F . It follows that the sum

K =∑

n≥0λnKn

is a symmetric positive semidefinite kernel on F whenever the series con-verges pointwise on F × F . In particular

K =∑

n≥0λnφn(x)φn(y) (3.2)

is a symmetric positive semidefinite kernel on F for all functions φn : F → Rsuch that the above series converges pointwise on F ×F . We shall see laterthat many important kernels can be shown to have this form. If the sum isfinite the kernel (3.2) is called degenerate.

Finally, if K is a symmetric positive semidefinite kernel on F and G :F → F any map, then K(x, y) := K(Gx,Gy) is a symmetric positivesemidefinite kernel on F .

Example 3.1.2 (Most general positive semidefinite kernel). Let Hbe a real Hilbert space and Φ : F → H any map. Define the kernel K on Fas

K(x, y) = (Φ(x),Φ(y))H , x, y ∈ F.ThenK is a symmetric positive semidefinite kernel on F . To see this considerfinitely many xi ∈ F and wi ∈ R and note that∑

i,jwiwjK(xi, xj) =

∥∥∥∑iwiΦ(xi)

∥∥∥2

H≥ 0.

3.2. TRANSLATION INVARIANT KERNELS 39

We shall see below that this is the general case. Every positive semidefinitekernelK on F can be represented like this with a Hilbert space H of functionson F which has the property that the evaluation functionals are continuousin the norm of H. Such a Hilbert space will be called a reproducing kernelHilbert space.

Example 3.1.3 (Kernel defined on sets). Let F = F be the σ-field ofevents in a probability space (Ω,F , P ) and define the kernel K on F as

K(E,F ) := P (A ∩B), A,B ∈ F.

Then K is symmetric and positive semidefinite: write P (A ∩B = E(1A1B)and conider finitely many Ai ∈ F and wi ∈ R. Then∑

i,jwiwjK(Ai, Aj) =

∑i,jwiwjE

[1Ai1Aj

]= E

∣∣∣∑iwi1Ai

∣∣∣2 ≥ 0.

This kernel will be used below to define Gaussian white noise on Rn.

The following inequality is basic in the theory of positive semidefinite kernels

Proposition 3.1.1. Let K be a positive semidefinite kernel on F . Then

|K(x, y)|2 ≤ K(x, x)K(y, y), for all x, y ∈ F.

Proof. The result follows imediately from the positive semidefiniteness of the2 by 2 matrix C = (K(xi, xj)), where x1 = x and x2 = y.

Exercise 3.1.1 (General kernel inequality). Show the following gener-alization of Proposition 3.1.1: if x1, . . . , xn, y ∈ F , then∣∣∣∑n

i=1ujK(xi, y)

∣∣∣2 ≤ K(y, y)∑n

i,j=1uiujK(xi, xj). (3.3)

Hint: set xn+1 = y, let C be the n + 1 by n + 1 matrix Cij = K(xi, xj),λ ∈ R arbitrary and u the vector (u1, . . . , un, λ)′ ∈ Rn+1. Now work fromthe inequality (Cu, u) = u′Cu ≥ 0.

3.2 Translation invariant kernels

Assume now that the index set is F = Rn. We make use of the (additive)group structure on F . Let T := z ∈ C | |z| = 1 be the unit circle viewed


as a multiplicative group. For x ∈ F define the character (multiplicativefunctional) χx : F → T as

χx(t) := exp[i(x · t)], t ∈ F = Rn,

where i is the imaginary unit as usual. Note that

χx−y = χxχy, for all x, y ∈ F. (3.4)

If Q is a finite Borel measure on F the Fourier transform of Q is the functionQ : F → C defined as

Q(x) := EQ [χx] , x ∈ F.

From χ−x = χx it follows that Q(−x) = Q(x). A kernel K on F is calledtranslation invariant if it satisfies

K(x+ u, y + u) = K(x, y) (3.5)

for all x, y, u ∈ F . In this case K has the form

K(x, y) = f(x− y), where f(x) = K(x, 0) (3.6)

and conversely every kernel of the form K(x, y) = f(x − y) is translationinvariant. Obviously K is symmetric if and only if f(−x) = f(x). We have

Theorem 3.2.1 (Bochner). A translation invariant kernel K on F = Rn

is symmetric and positive semidefinite if and only if it has the form

K(x, y) = Q(x− y)

for some finite positive Borel measure Q on F .

Proof. We show only that every kernel defined as above is symmetric andpositive semidefinite (the trivial direction). Assume that K is as above.Then Q(−x) = Q(x) implies that K(y, x) = K(x, y) = K(x, y), the lastequality since K is real valued. Moreover

Q(x− y) = EQ [χxχy]

and so for xi ∈ F and wi ∈ R we have∑i,jwiwjK(xi, xj) =

∑i,jwiwjEQ

[χxiχxj

]= EQ

∣∣∣∑iwiχxi

∣∣∣2 ≥ 0.

3.3. REPRODUCING KERNEL HILBERT SPACES 41

Example 3.2.1 (Gaussian kernel). The standard normal distributionQ = N(0, I) on Rn (independent standard normal components) has Fouriertransform

Q(x) ∝ exp(−‖x‖2 /2),

where ∝ denotes equality up to a constant factor as usual. Consequently

K(x, y) := exp(−‖x− y‖2 /2)

defines a symmetric positive definite kernel on F = Rn. This is the standardGaussian kernel which can be transformed in various ways, see 3.1.1. Forexample if G : F → F is any map and c, b > 0 then

K(x, y) := cK(Gx,Gy) + b

defines another positive semidefinite kernel on F .

3.3 Reproducing kernel Hilbert spaces

Let F be a nonempty set. A (real) Hilbert space H is called a reproducingkernel Hilbert space on F if the elements of H are real valued functionsdefined everywhere on F and the evaluation functionals ex, x ∈ F , definedas

Ex : f ∈ H 7−→ f(x) ∈ R

are continuous in the norm of H. Note that L2(µ) is not a reproducingkernel Hilbert space since the elements of this space cannot be identifiedwith functions defined on the underlying set. Rather they are equivalenceclasses or functions equal µ-almost surely.

The connection to positive definite kernels is as follows: for x ∈ F theevaluation functional ex is given by the inner product

f(x) = Ex(f) = (f,Φ(x))H, f ∈ H,

for a unique element Φ(x) ∈ H (Riesz representation theorem for continuouslinear functionals). This element Φ(x) is a function defined everywhere onF and we can now define the associated kernel K = KH as

KH(x, y) := Φ(x)(y) = (Φ(x),Φ(y))H, x, y ∈ F. (3.7)

Then K is symmetric and positive semidefinite. It is positive definite if andonly if the Φ(x) or, equivalently, the evaluation functionals ex, x ∈ F , are


linearly independent on H (see Example 3.1.2). If H does not contain enoughfunctions this may not be the case and examples are easy to construct.

By definition of the kernel K the Φ(x) ∈ H are the kernel functions Kx

defined by Kx(y) = K(x, y). This indicates how we can conversely con-struct a reproducing kernel Hilbert space H satisfying (3.7) from a positivesemidefinite kernel K.

Let K be a symmetric and positive semidefinite kernel on F and recallthat RF denotes the space of all functions on f : F → R. Define the kernelfunctions Kx : F → R as

Kx(y) := K(x, y), y ∈ F,

let Φ : X → RX be the map Φ : x ∈ F 7→ Kx ∈ RF and set

H0 = span(Φ(X)) ⊆ RF .

In other words, H0 is the space of all finite linear combinations of kernelfunctions Kx, x ∈ F . On H0 define an inner product (·, ·) as follows: forfinite linear combinations f =

∑i uiKxi and g =

∑j wjKyj set

(f, g) :=∑

i,juiwjK(xi, yj). (3.8)

We must show that this does not depend on the particular representation off and g as linear combinations of the kernel functions. Consider two suchrepresentations of f respectively g. We may assume that these have theform

f =∑

iuiKxi =

∑iuiKxi and g =

∑jwjKyj =

∑iwiKyi ,

that is, they use the same set of points respectively. This can always beachieved by enlarging the set of points and setting coefficients equal to zeroas needed which has no effect on the sum on the right of (3.8). We have toshow that ∑

i,juiwjK(xi, yj) =

∑i,juiwjK(xi, yj).


Indeed we can write∑i,juiwjK(xi, yj) =

∑jwj

(∑iuiKxi(yj)

)=

∑jwjf(yj) (switch the representation of f)

=∑

i,juiwjK(xi, yj) (symmetry of K)

=∑

iui

(∑jwjKyj(xi)

)=

∑iuig(xi) (switch the representation of g)

=∑

i,juiwjK(xi, yj).

as desired. Directly form the definition of this bilinear form we have thereproducing property

f(x) = (f,Kx), x ∈ F, f ∈ H0. (3.9)

Since the kernel K is positive semidefinite (3.8) defines a symmetric, positivesemidefinite bilinear form on H0. In consequence we have the the Cauchy-Schwartz inequality

(f, g) ≤ ‖f‖K ‖g‖K , f, g ∈ H0,

where‖f‖K =

√(f, f).

This follows from the inequality (f + λg, f + λg) ≥ 0, for all λ ∈ R. Conse-quently ‖ · ‖K is a seminorm on H0 which satisfies

|f(x)| ≤ ‖f‖K ‖Kx‖K .

by the reproducing property. Thus ‖f‖ = 0 implies that f(x) = 0, for allx ∈ F , that is, f = 0. It follows that ‖ · ‖K is a norm on H0.

Let H be the completion of H0 in its norm ‖ · ‖K . The inner producton H0 extends to the completion H and makes H a Hilbert space. We stillhave to identify the elements of H with functions on F and show that thereproducing property (3.9) extends to all elements in the completion H.

For f ∈ H define the function f : F → R as f(x) = (f,Kx). Then themap J : f ∈ H → f ∈ RF is linear and f = f , whenever f ∈ H0 (by 3.9).

It will now suffice to show that J is one to one. For then we can identifythe element f ∈ H with the function f on F and the reproducing property


(3.9) is true by definition. Assume that f = 0 in RF , that is, (f,Kx) = 0,for all x ∈ F . Then f ⊥ H0 and it follows that f = 0 since H0 is dense inH. Thus ker(J) = 0.

With this the reproducing property (3.9) holds for all f ∈ H and thisshows in particular that the evaluation functionals ex : f ∈ H 7→ f(x) ∈ Rare continuous. In other words, H is a reproducing kernel Hilbert space andthe kernel functions Φ(x) = Kx ∈ H0 represent the evaluation functionalson H. It follows that

K(x, y) = (Φ(x),Φ(y)) = KH(x, y), (3.10)

for all x, y ∈ F . The Hilbert space H is called the reproducing kernel Hilbertspace associated with the kernel K and denoted by HK . It will be seenbelow that HK is the smallest Hilbert space satisfying (3.10) and is uniquelydetermined up to isometry by the reproducing property.

Theorem 3.3.1 (RKHS). A kernel K = K(x, y) on F is symmetric andpositive semidefinite if and only if it has the form

K(x, y) = (Ψ(x),Ψ(y))H , x, y ∈ F, (3.11)

for some Hilbert space H and map Ψ : F → H. In this case the closed linearspan span(Ψ(F )) is isomorphic to the Hilbert space HK above.

Proof. We only have to show that span(Ψ(F )) is isomorphic to HK . LetΦ : x ∈ F 7→ Kx ∈ HK be as above and recall that span(Φ(F )) is dense inHK . Then∥∥∥∑

jwjΦ(xj)

∥∥∥2=

∑i,jwiwjK(xi, xj) =

∥∥∥∑jwjΨ(xj)

∥∥∥2

and this shows that the map

U :∑

jwjΦ(xj) 7→

∑jwjΨ(xj)

from span(Φ(F )) ⊆ HK to span(Ψ(F )) ⊆ H is well defined and is a linearisometry. Consequently U extends to an isometry U : HK → span(Ψ(F ))such that UΦ = Ψ.

Proposition 3.3.1. The reproducing kernel Hilbert space with kernel K isuniquely determined.


Proof. Let H be a Hilbert space of functions on F and Ψ : F → H a mapwith the reproducing property

f(x) = (f,Ψ(x))H , x ∈ F, f ∈ H.

Obviously thenf ⊥ Ψ(x), ∀x ∈ F ⇒ f = 0,

that is, span(Ψ(F )) is dense in H and the uniqueness of H follows.

Remark. In full abstract generality the notion of a reproducing kernel doesnot add anything new to the theory of Hilbert spaces. Every Hilbert spaceH is a reproducing kernel Hilbert space on the set F = H with kernelK(x, y) = (x, y)H , x, y ∈ F .

This is merely a reformulation of the well known identification of thedual space of H with itself. It is the additional structure which the kernelsmay introduce which makes the notion of a reproducing kernel Hilbert spaceinteresting.

Remark. In the construction of the reproducing kernel Hilbert space H withgiven kernel K we did not have to assume that K was positive definite toget a norm rather than a seminorm on the span of the kernel functions.Likewise the continuity of the evaluation functionals emerged automaticallyfrom the reproducing property.

Note that both these facts follow from the inequality in Exercise 3.1.1.We did not have to invoke this inequality explicitly since it is simply a specialcase of the Cauchy-Schwartz inequality satisfied by every symmetric, positivesemidefinite bilinear form.

We conclude this section with some examples.

Example 3.3.1 (Finite dimensional RKHS). Let F be a nonempty setand consider a kernel K on F which has the form

K(x, y) =∑N

j=1λjφj(x)φj(y), x, y ∈ F, (3.12)

where the λj are positive numbers and φ1, . . . , φN are linearly independentfunctions on F . Such a kernel is called degenerate. We will see that thecorresponding reproducing kernel Hilbert space HK with kernel K is finitedimensional. To this end set

ψj =√λjφj


and let H = span(ψ1, . . . , ψN). On H introduce an innder product asfollows:

(f, g) =∑N

j=1αjβj , where f =

∑N

j=1αjψj , g =

∑N

j=1βjψj. (3.13)

The linear independence of the φj guarentees that the representation of fand g as linear combinations of the ψj is unique and hence the inner productwell defined. Note that (3.13) is the unique inner product on H in which theψj are orthonormal and hence an orthonormal basis for H. Note now thatthe kernel functions Kx, x ∈ F , satisfy

Kx =∑N

j=1λjφj(x)φj =

∑jψj(x)ψj ∈ H

with inner product

(Kx,Ky) =∑

jψj(x)ψj(y) = K(x, y). (3.14)

Each ψi has the expansion ψi =∑

j βjψj with βi = 1 and βj = 0 for j 6= i.Combining this with the expansion for Kx we see that

(Kx, ψi) = ψi(x)

and from this it follows that

(Kx, f) = f(x), f ∈ H, x ∈ F, (3.15)

since both sides of the equation are linear in f . But this shows that H is areproducing kernel Hilbert space with kernel L(x, y) given by

L(x, y) = (Kx,Ky) = K(x, y).

Thus we have H = HK by the uniqueness of the reproducing kernel Hilbertspace HK with kernel K.

We can also verify this equality without appeal to the uniqueness of HK .Let H0 be the linear span of the kernel functions Kx, x ∈ F , and (·, ·)Kdenote the inner product in the reproducing kernel Hilbert space HK . ThenH0 ⊆ H and

(Kx,Ky) = (Kx,Ky)K , x, y ∈ F,

according to (3.14). Using the bilinearity of inner products this implies

(f, g) = (f, g)K , f, g ∈ H0.


In other words: on the subspace H0 the inner products of H and of thekernel reprdoducing Hilbert space HK are the same.

Since the finite dimensional Hilbert space H is automatically completewe have HK ⊆ H and the inner product on HK is the same as the innerproduct on H. We claim that HK = H.

It will suffice to show that H⊥K = 0. Indeed, if f ∈ H⊥

K , then f(x) =(Kx, f) = 0, for all x ∈ F and hence f = 0. The entire information in thisexample can be summarized as follows:

If the kernel K(x, y) has the form (3.12) then the functions ψj =√λjφj

are an orthonormal basis in the reproducing kernel Hilbert space HK .We will see in the next section 3.4 that this example completely describes

all finite dimensional reproducing kernel Hilbert spaces H. More precisely:if H is a finite dimensional reproducing kernel Hilbert space and ψj anyorthonormal basis for H, then the kernel K of H has the expansion

K(x, y) =∑

jψj(x)ψj(y), x, y ∈ F,

that is, we are in the situation of this example with λj = 1. In section 3.8we will introduce the class of Mercer kernels for which this analysis carriesover to the infinite dimensional case.

Example 3.3.2 (Convolution kernel). Let F = Rd and dx denote Lebesguemeasure on Rd normalized with the factor (2π)−d/2. This normalizationeliminates an otherwise necessary constant factor from the Fourier inversionformula below.

Assume that the kernelK on F is translation invariant, that is, K(x, y) =k(x− y). Then the integral operator TK is the convolution operator

TKf = k ∗ f.

The kernel K is not square integrable unless k = 0 and so TK may not bedefined as an operator on L2(F ). Let us assume therefore that k is in thespace S of rapidly decreasing functions on Rn. In this case the convolutionoperator TK is a bounded linear operator on L2(F ).

Recall that the Fourier transform F : L2(F ) → L2(F ) is a unitary iso-morphism which maps S onto S. To maintain consistency with the Fouriertransform of a measure in Bochner’s theorem 3.2.1 we define the Fouriertransform as

(Fh)(t) =∫

Xχt(x)h(x)dx =

∫Xh(x)exp[i(t · x)]dx,


where χt(u) = exp[i(t · u)] for all t, u ∈ F . According to Bochner’s theoremthe kernel K is positive semidefinite if and only if g := Fk ∈ S satisfiesg ≥ 0. To simplify the example assume that g(u) > 0, for all u ∈ F and seth(u) = g(−u). Then

k = Fhby the Fourier inversion formula. Here the normalization of Lebesgue mea-sure is used (otherwise we have a constant factor in the inversion formula).Recall that F(k ∗ f) = FkFf . Apply F−1 to see that

TK = F−1MgF ,

where Mg is the multiplication operator

f ∈ L2(F ) 7→ gf ∈ L2(F ).

This shows that TK is not compact. Indeed,

‖TKf‖2 ≥ c ‖f‖2

whenever f has support contained in the set g > c2 and that happens on aclosed infinite dimensional subspace of L2(F ).

Now consider the reproducing kernel Hilbert space H with kernel K. Weclaim that

F(Kx) = χxh, x ∈ F. (3.16)

Indeed, for t ∈ F = Rd we have

(FKx)(t) =∫

XKx(u)exp[i(t · u)]du =

∫Xk(x− u)exp[i(t · u)]du

=∫

Xk(u)exp[i(t · (x− u))]du = χx(t)(Fk)(−t) = χx(t)h(t).

Let now f =∑n

j=1wjKxj . In the proof of Bochner’s theorem we havecomputed the norm ‖f‖K as

‖f‖2K =∑

ijwiwjK(xi, xj) =

∫X

∣∣∣∑jwjχxi

∣∣∣2=

∫X

∣∣∣F ∑jwjKxi

∣∣∣2 /h =

∫X|Ff |2/h

From the above H is the closure of the span of the Kx in the Hilbert space

H =f ∈ L1(F ) :

∫F|Ff |2/h <∞

3.4. BILINEAR KERNEL EXPANSION 49

endowed with the inner product

(f, g) =∫

XFfFg/h.

H is easily seen to be a Hilbert space (ie. complete). For f ∈ H we have∫F|Ff(x)|2(1 + ‖x‖2)N dx <∞,

for all N ≥ 1. From this we get∫F|Ff(x)|(1 + ‖x‖2)N dx <∞,

for all N ≥ 1, by the Cauchy-Schwartz inequality. This shows that H ⊆S. From the Fourier Inversion Theorem we conclude that the evaluationfunctionals are still continuous on H. The uniqueness of the reproducingkernel Hilbert space now implies that H = H. See [MM02] for a morethorough investigation.

3.4 Bilinear kernel expansion

Let K be a symmetric, positive semidefinite kernel on the set F , H = HK

the reproducing kernel Hilbert space with kernel K and ψj : j ≥ 1 anorthonormal basis for H.

Since the evaluation functionals are continuous on H convergence in thenorm of H implies pointwise convergence at each point x ∈ F . Now let x ∈ Fand expand the kernel functionKx in terms of the basis ψj. Because of thereproducing property the coefficients (Kx, ψj) are exactly the values ψj(x)and the expansion reads

Kx =∑

jψj(x)ψj .

Since this converges in the norm of H, it converges at each point y ∈ F andevaluation at y yields the bilinear kernel expansion

K(x, y) =∑

jψj(x)ψj(y). (3.17)

convergent pointwise at each point (x, y) ∈ F × F . We will now placeincreasingly restrictive conditions on the kernel K and the set F and obtain


increasingly stronger forms of convergence. The kernel K is called boundedif

supx∈F K(x, x) <∞.

Using the basic inequality |K(x, y)|2 ≤ K(x, x)K(y, y) this implies that

‖K‖∞ = supx,y∈F |K(x, y)| <∞,

that is, K is bounded on F × F in the usual sense. Recall that the kernelfunctions Kx ∈ H satisfy ‖Kx‖2 = K(x, x) and consequently K is boundedif and only if the set Kx : x ∈ F of kernel functions is bounded in H.

Proposition 3.4.1. If K is bounded then convergence in the norm of H im-plies pointwise convergence uniformly on X and the bilinear kernel expansion(3.17) converges uniformly in y for each fixed x ∈ F and conversely.

Proof. Let f, g ∈ H and x ∈ F . Then

|f(x)− g(x)| = |(f − g,Kx)| ≤ ‖f − g‖ ‖Kx‖ ≤ ‖K‖∞ ‖f − g‖ .

Thus ‖f − g‖ → 0 implies that f(x) → g(x) uniformly on F . The claimabout the expansion (3.17) follows from this.Assume now that the set F has a topology and that K is bounded andcontinuous at each point (x, x) of the diagonal in F × F . Then

Proposition 3.4.2. Each function f ∈ H is continuous on F and the ex-pansion (3.17) converges uniformly on compact subsets of F × F .

Proof. Let x, y ∈ F . Then ‖Kx −Ky‖2 = K(x, x)− 2K(x, y) +K(y, y) andso, for f ∈ H,

|f(x)− f(y)|2 = |(f,Kx −Ky)|2 ≤ ‖f‖2 [K(x, x)− 2K(x, y) +K(y, y)] .

Since K is continuous along the diagonal the right hand side converges tozero as x→ y. This shows that f is continuous. We already know that theexpansion (3.17) converges pointwise on F × F . In particular we have

K(x, x) =∑

jψ2

j (x).

The terms of the series are nonnegative and continuous and so the partialsums

Sn(x) =∑n

j=1ψ2

j (x)

3.4. BILINEAR KERNEL EXPANSION 51

converge to the continuous limit K(x, x) in a an increasing manner. Dini’stheorem below shows that this convergence is uniform on compact subsetsof F . In other words, the remainders

Rn(x) =∑

j>nψ2

j (x)

converge to zero uniformly on compact subsets of F . Now set

Kn(x, y) = K(x, y)−∑

j≤nψj(x)ψj(y) =

∑j>n

ψj(x)ψj(y).

Then Kn is a symmetric and positive semidefinite kernel on F . Note thatKn(x, x) = Rn(x). The basic kernel inequality 3.1.1 now implies that∣∣∣K(x, y)−

∑j≤n

ψj(x)ψj(y)∣∣∣2 = |Kn(x, y)|2 ≤ Rn(x)Rn(y).

Thus, as n ↑ ∞, the left hand side converges to zero uniformly on compactsubsets of F × F .

Theorem 3.4.1 (Dini). Let X be any compact space and Sn, S continuousfunctions on X. If Sn ↑ S pointwise on X then the convergence is uniformon X.

Remark. The notation Sn ↑ S means that the sequence (Sn) is nondecreasingand S(x) = limn Sn(x), for each point x ∈ X.

Proof. Set Rn = S − Sn. Then Rn ≥ 0 is continuous and Rn ↓ 0, pointwiseon X (nonincreasing convergence). We must show that the convergence isuniform on X.

Let ε > 0. For each x ∈ X we can choose an integer n(x) such thatRn(x)(x) < ε. By continuity of Rn(x) we will then have Rn(x) < ε in aneighborhood Ux of x. By compactness we can cover X with finitely manyof these neighborhoods, say Uxj , j = 1, 2, . . . , k. Let

N = maxn(x1), . . . , n(xk) .

Assume that n ≥ N and let y ∈ X. Then we have y ∈ Uxj , for somej ∈ 1, . . . , k, and consequently

Rn(y) ≤ Rn(xj)(y) < ε.

Taking the sup over all y ∈ X implies ‖Rn‖∞ ≤ ε and this shows thatRn ↓ 0, uniformly on X.


3.5 Characterization of functions in HK

Let F be a nonempty set and K : F ×F → R a positive semidefinite kernelon F . We now investigate when a function f : F → R is in the reproducingkernel Hilbert space H = HK with kernel K. Set

V = span(Ks : s ∈ F ) ⊆ HK

and note that V is dense in H. Let ‖ · ‖K denote the norm in H and assumethat f ∈ H. Then

‖f‖K = sup |(f, g)| : g ∈ V, ‖g‖K ≤ 1 . (3.18)

Every g ∈ V can be written as

g =∑

s∈Sa(s)Ks,

with S ⊆ F a finite subset and coefficients a : S → R. For such g we have

(f, g) =∑

s∈Sa(s)f(s)

by the reproducing property of the kernel functions Ks and

‖g‖K =∑

s,t∈Sa(s)a(t)K(s, t).

If now f : F → R is any function (not necessarily in H) we define

|f |K = supS,a

∣∣∣∑s∈S

a(s)f(s)∣∣∣ , (3.19)

where the sup is taken over all finite subsets S ⊆ F and coefficients a : S → Rsatisfying ∑

s,t∈Sa(s)a(t)K(s, t) ≤ 1. (3.20)

We have just seen that f ∈ H implies that |f |K = ‖f‖K <∞. Now we showthat conversely |f |K < ∞ implies that f ∈ HK . Assume that |f |K < ∞.Then the map

Λf :∑

s∈Sa(s)Ks ∈ V 7→

∑s∈S

a(s)f(s) ∈ R

is well defined and satisfies |Λf (g)| ≤ |f |K , for all g =∑

s∈S a(s)Ks ∈ Vwith ‖g‖K ≤ 1. Consequently Λf extends to a bounded linear functionalΛ on H and there exists an element h ∈ H such that Λ(g) = (g, h), for allg ∈ H. Let s ∈ F and set g = Ks to obtain

f(s) = Λf (Ks) = (Ks, h) = h(s). (3.21)

Thus f = h ∈ H. We can summarize these findings as follows

3.5. CHARACTERIZATION OF FUNCTIONS IN HK 53

Proposition 3.5.1. Let f : F → R be any real valued function on F . Thenf ∈ H if and only if |f |K <∞. If this is the case we have |f |K = ‖f‖K .

For each set S ⊆ F let HS be the reproducing kernel Hilbert space withkernel K|S×S on the set S. Note that the restriction g = f |S satisfies|g|K ≤ |f |K and so f ∈ H implies g ∈ HS and the restriction map

IS : f ∈ H 7→ f |S ∈ HS

is a contraction. Now let F be the family of all finite subsets S ⊆ F andS ∈ F . For any function f : F → R set

|f |S := supa

∣∣∣∑s∈S

a(s)f(s)∣∣∣ , (3.22)

where the sup is taken over all coefficients a : S → R satisfying∑s,t∈S

a(s)a(t)K(s, t) ≤ 1. (3.23)

Then |f |S < ∞ if and only if the restriction f |S is in HS in which case|f |S = ‖f |S‖K . Note that |f |S <∞ if and only if∑

s∈Sa(s)Ks = 0 =⇒

∑s∈S

a(s)f(s) = 0.

Obviously |f |K = supS∈F |f |S and so

Proposition 3.5.2. Let f : F → R be any real valued function on F . Thenf ∈ H if and only if f |S ∈ HS, for each finite subset S ⊆ F and

supS∈F ‖f |S‖K <∞.

For a finite subset S ⊆ F let KS be the matrix KS := (K(s, t))s,t∈S .The columns of KS are the kernel functions Ks, s ∈ S restricted to the setS.

The matrix KS is positive semidefinite. Consequently KS is invertible ifand only if it is positive definite. Note that these properties do not dependon the particular numbering (order) of the elements in S.

Proposition 3.5.3. Let S be a finite subset of F . Then the matrix KS isinvertible if and only if the kernel functions Ks, s ∈ S, are linearly indepen-dent in H.


Proof. Since KS is positive semidefinite it is invertible if and only if it ispositive definite. The result now follows from the equality∑

s,t∈Sa(s)a(t)K(s, t) =

∥∥∥∑s∈S

a(s)Ks

∥∥∥2

K,

for all coefficients a : S → R.

Proposition 3.5.4. Assume that F is finite and the kernel functions Ks,s ∈ F , are linearly independent in RF . Then H = RF and for any functionf : F → R we have

‖f‖2K =∑

s,t∈Ff(s)f(t)K−1(s, t),

where K−1 is the inverse of the matrix KS = (K(s, t))s,t∈F .

Proof. RF and H = span(Ks : s ∈ F) ⊆ RF both have dimension card(F )and it follows that H = RF . Let f ∈ RF and write f =

∑s∈S a(s)Ks, that

is,f(t) =

∑s∈S

K(t, s)a(s),

for all t ∈ S and thus

a(s) =∑

t∈SK−1(s, t)f(t),

for all s ∈ S. From this it follows that

‖f‖2K =∑

s,t∈Sa(s)a(t)K(s, t)

=∑

s∈Sa(s)

∑t∈S

a(t)Kt(s)

=∑

s∈Sa(s)f(s)

=∑

s,t∈Sf(s)f(t)K−1(s, t).

Proposition 3.5.5. Let T ⊆ F be a subset such that the linear span of thekernel functions Ks, s ∈ T , is dense in H and f : F → R be any function.If f |T ∈ HT , then there exists a unique element h ∈ H with h = f on theset T .

Proof. Let IT : h ∈ H 7→ h|T ∈ HT be the restriction map. Since IT is acontraction the linear functional

Λ(g) = (f |T , IT (g)) , g ∈ H,

3.6. KERNEL DOMINATION 55

is bounded and there exists an element h ∈ H such that Λ(g) = (g, h), forall g ∈ H. Let R be the restriction of the kernel K to the set T ×T . If s ∈ Twe have IT (Ks) = Rs and so

f(s) = (f |T )(s) = (f |T , Rs) = Λ(Ks) = (Ks, h) = h(s).

Thus f = h on T . It remains to be shown that h is unique. Indeed, if h ∈ Hand h = f on T , then

(Ks, h) = h(s) = f(s)

for all s ∈ T and this condition determines h uniquely since the span of theKs, s ∈ T , is dense in H.

Proposition 3.5.6. Assume that H is separable. Then there exists a count-able subset T = sn : n ≥ 1 ⊆ F such that the kernel functions Ks,s ∈ T , are linearly independent and have dense span in H. For n ≥ 1 letTn = s1, . . . , sn. Then

‖f‖K = limn ‖f |Tn‖K , f ∈ H.

Remark. Here ‖f |Tn‖K is the norm of the restriction f |Tn in the reproducingkernel Hilbert space HTn on Tn with kernel K.Proof. Choose a dense sequence (hn) ⊆ H. For each n choose a finite subsetSn ⊆ F and coefficients an : Sn → R such that∥∥∥hn −

∑s∈Sn

an(s)Ks

∥∥∥K< 1/n.

Let S =⋃

n Sn. Then the kernel functions Ks, s ∈ S, have dense span inH. Now enumerate S as S = sj : j ≥ 1. Inductively delete each sj suchthat Ksj is in the linear span of the Ksi , i < j. The remaining set T has allrequired properties.

Let f ∈ HK . Then ‖f |Tn‖K ≤ ‖f‖K , for each n ≥ 1. It will thus sufficeto show that ‖f‖K = supn ‖f |Tn‖K .

If f =∑

s∈Tna(s)Ks then ‖f‖K = ‖f |Tn‖K by direct computation of

the norms. For the remaining f ∈ H the result follows by density and thefact that the restriction maps In : H → HTn are contractions and henceequicontinuous.

3.6 Kernel domination

Let F be a nonempty set and R,K : F×F → R positive semidefinite kernelson F and HK , HR the reproducing kernel Hilbert spaces with kernels K and


R respectively. We say that the kernel R dominates the kernel K denotedR K, iff

HK ⊆ HR. (3.24)

This property will now be investigated. Note that the inclusion makes sensesince both HK and HR are spaces of real valued functions on F .

Proposition 3.6.1. If HK ⊆ HR then the inclusion map is automaticallycontinuous, that is, there is a constant C such that

‖f‖R ≤ C ‖f‖K , f ∈ HK . (3.25)

Proof. This follows immediately from the Closed Graph Theorem. Note thatconvergence in either ‖ · ‖K or ‖ · ‖R implies convergence pointwise at allpoints of F .Here is the characterization of kernel domination:

Proposition 3.6.2. The following are equivalent:(i) The kernel R dominates K (ie. HK ⊆ HR).(ii) There is a constant C such that∑

s,t∈Sa(s)a(t)K(s, t) ≤ C2

∑s,t∈S

a(s)a(t)R(s, t), (3.26)

for all finite subsets S ⊆ F and all coefficients a : S → R.(iii) There is a bounded linear operator L : HR → HK which satisfiesL(Rs) = Ks, for all s ∈ F .(iv) There is a bounded linear operator L : HR → HK which satisfies(L(Rs), Rt)R = K(s, t), for all s, t ∈ F .

Definition 3.6.1 (Domination operator). The operators L : HR → HK

in (iii) and (iv) are easily seen to be the same. Moreover the properties in(iii) or (iv) uniquely determine L. The operator L is called the dominationoperator.

Proof. (i)⇒(ii) Assume that HK ⊆ HR and let C be a constants in (3.6.1).Let S ⊆ F be a finite subset, a : S → R any coefficients and set f =∑

s∈S a(s)Ks. Using the reproducing property of the kernel functions Rt inHR we can write

‖f‖2K =∑


∑t∈S

a(t)f(t)

=(f,

∑t∈S

a(t)Rt

)R≤ ‖f‖R

∥∥∥∑t∈S

a(t)Rt

∥∥∥R

≤ C ‖f‖K

∥∥∥∑t∈S

a(t)Rt

∥∥∥R.


Divide by ‖f‖K to obtain

‖f‖K ≤ C∥∥∥∑

t∈Sa(t)Rt

∥∥∥R

and square both sides to obtain (ii).(ii)⇒(iii) Let V ⊆ HK be the linear span of the kernel functions Ks, s ∈ F .Rewrite the inequality in (ii) as∥∥∥∑

s∈Sa(s)Ks

∥∥∥2

K≤ C2

∥∥∥∑s∈S

a(s)Rs

∥∥∥2

R.

This shows that the linear map

L :∑

s∈Sa(s)Ks ∈ V 7→

∑s∈S

a(s)Rs ∈ HR

is well defined and continuous. Consequently L extends to a bounded linearoperator L : HK → HR which satisfies L(Ks) = Rs. It remains to beshown that L is uniquely determined. Indeed, the requirement L(Ks) = Rs

combined with linearity determines L uniquely on the dense subspace V ⊆HK and continuity now determines L on all of HK .(iii)⇒(iv) The operator L in (iii) has the property required in (iv).(iv)⇒(ii) The operator L in (iv) satisfies (LRs, Rt)R = (Ks, Rt)R = K(s, t)by the reproducing property of the kernel functions Rt in HR. Set D = ‖L‖and let S ⊆ F be any finite subset and a : S → R any coefficient function.Set f =

∑s∈S a(s)Rs ∈ HR. Then we have∑


∑s,t∈S

a(s)a(t)(LRs, Rt)R = (Lf, f)R

≤ D ‖f‖2R = D∑

s,t∈Sa(s)a(t)R(s, t).

(ii)⇒(i) According to (ii) we have∑s,t∈S

a(s)a(t)R(s, t) ≤ 1 ⇒∑


⇒∑

s,t∈S(a(s)/C)(a(t)/C)K(s, t) ≤ 1.

for all finite subsets S ⊆ F and coefficients a : S → R. If f : F → R is anyfunction the quantities |f |K , |f |R defined in (3.19) of the previous sectionsatisfy

|f |R ≤ C|f |Kby the way the sets over which the sups are taken are constrained. If nowf ∈ HK , then |f |K < ∞ which implies |f |R < ∞ which implies f ∈ HR

according to (3.5.1).Let us note some properties of the domination operator


Proposition 3.6.3 (Domination operator). Assume that HK ⊆ HR.Then the domination operator is the unique bounded linear operator L :HR → HK satisfying

(Lf, g)K = (f, g)R, f ∈ HR, g ∈ HK .

L is also an operator from HR into itself and as such it is bounded, selfadjointand positive.

Proof. The operator L is characterized by L(Rs) = Ks, s ∈ F . By continuityand linearity of L as well as bilinearity of the inner products it will sufficeto show that

(LRs,Kt)K = (Rs,Kt)R, s, t ∈ F, (3.27)

equivalently (Ks,Kt)K = (Rs,Kt)R. Indeed, both sides equal K(s, t) bythe reproducing properties of the kernel functions in HK and HR. We canrewrite (3.27) as (LRs)(t) = K(s, t) and this obviously uniquely determinesL on kernel functions Rs and hence on all of HR.

Now consider L as an operator of HR into itself. Since L : HR → HK iscontinuous and the inclusion of HK into HR is continuous, L : HR → HR iscontinuous. To verify that L is selfadjoint it will suffice to show that

(LRs, Rt)R = (Rs, LRt)R,

equivalently (Ks, Rt)R = (Rs,Kt)R and indeed both sides equal K(s, t).Finally, if f ∈ HR is of the form

f =∑

s∈Sa(s)Rs

for some finite subset S ⊆ F and coefficients a : S → R, then

(Lf, f)R =(∑

s∈Sa(s)Ks,

∑s∈S

a(s)Rs

)R

=∑

s,t∈Sa(s)a(t)K(s, t) ≥ 0.

The same now follows for arbitrary f ∈ HR by continuity. Thus L : HR →HR is positive.

The domination R K is called nuclear if R dominates K and thedomination operator L : HR → HR is a trace class operator as an operatorof HR into itself. In this case we write R K.


Example 3.6.1. Assume that the kernels K, R on F have the form

K(s, t) =∑∞

n=1λnφn(s)φn(t) and R(s, t) =

∑∞

n=1µnφn(s)φn(t),

for some functions φn : F → R and numbers λn, µn > 0 such that the aboveseries converge pointwise on F ×F . Then K and R are symmetric, positivesemidefinite kernels on F and we have∑

s∈Sa(s)a(t)K(s, t) =

∑∞

n=1λn

∣∣∣∑s∈S

a(s)φn(s)∣∣∣2 (3.28)

and a similar formula holds for the kernel R. From this we see that Rdominates K if

supn λn/µn <∞

Under suitable additional assumptions on the set F and functions φn theen :=

√µnφn, n ≥ 1, are an orthonormal basis in HR. See section 3.8

below. Assume that this is the case. Then there is a unique linear operatorL : HR → HR such that Lφn = λnµ

−1n φn, n ≥ 1. Since the expansion

Rs =∑∞

n=1µnφn(s)φn =

∑∞

n=1(Rs, en)en,

converges in HR we have LRs = Ks and it follows that L is the dominationoperator. L satisfies

(L(√µnφn),

√µnφn)R = λn/µn

and it follows that R dominates K in a nuclear fashion if and only if∑∞

n=1λn/µn <∞.

Proposition 3.6.4. Assume that R K. Then there exists a symmet-ric, positive semidefinite kernel Q on F such that R Q K and thereproducing kernel Hilbert space HQ is separable.

Proof. Let L : HR → HR be the domination operator (LRs = Ks, s ∈ F ).By assumption L is a trace class operator and so compact. It follows thatthe subspace H := N(L)⊥ ⊆ HR is separable. Note that L maps H into H,let

B : HR → N(L)⊥

be the orthogonal projection and define the kernel Q : F × F → R as

Q(s, t) = (BRs, Rt)R = (BRs)(t).


Since B is a selfadjoint, Q is a symmetric, positive semidefinite kernel on Fwhich is dominated by R.

We claim that the reproducing kernel Hilbert space HQ is equal to H.Let V be the span of the kernel functions Rs, s ∈ F , in HR. By definitionof the kernel Q we have BRs = Qs and hence B maps V into HQ. Iff =

∑s∈S a(s)Rs ∈ V , then Bf =

∑s∈S a(s)Qs and so

‖Bf‖2Q =∑

s,t∈Sa(s)a(t)Q(s, t)

=∑

s,t∈Sa(s)a(t) (BRs, Rt)R = (Bf, f)R ≤ ‖f‖2R .

From this it follows that B : V → HQ has a unique extension to a boundedlinear map B : HR → HQ and this extension satisfies

‖Bf‖2Q = (Bf, f)R, f ∈ HR.

The restriction of B to H = N(B)⊥ is then an isometric isomorphism of Hwith HQ. But note that this restriction is the identity map. Thus HQ = H.In particular the Hilbert space HQ is separable.

Let M be the restriction M := L|H = HQ. Then M : HQ → HQ is abounded linear operator. For s ∈ F we have

Ks = L(Rs) = L(BRs) = L(Qs) = M(Qs).

Here we have used that L = LB since B is the orthogonal projection ontoN(L)⊥. The preceeding equality shows that the kernel Q dominates K andthat M is the domination operator (see 3.6.2). Since M is again a traceclass operator we have Q K.

Proposition 3.6.5. Let R and K be symmetric, positive semidefinite ker-nels on F such that R K, L : HR → HR the domination operator andS ⊆ F a finite subset such that the kernel functions Rs, s ∈ S, are linearlyindependent in RF . Then Tr(KSR

−1S ) ≤ Tr(L).

Remark. Here Tr(A) denotes the trace of the operator A and RS is thematrix (R(s, t)s,t∈S . This matrix is invertible according to 3.5.3.

Proof. Let R and K be the restrictions of the kernels R and K to the set Sand HS be the reproducing kernel Hilbert space on the set S with kernel R.Then the restriction operator

IS : f ∈ HR 7→ f |S ∈ HS

3.7. APPROXIMATION IN REPRODUCING KERNEL HILBERT SPACES61

is a contraction. Since the matrix RS is positive definite the kernel func-tions Rs ∈ HS , s ∈ S, are linearly independent and hence a basis for HS.Consequently we have a well defined linear map

JS :∑

s∈Sa(s)Rs ∈ HS 7→

∑s∈S

a(s)Rs ∈ HR

and this map is an isometry of HS into HR. Let

U = ISLJS : HS →HS

and note that HS is finite dimensional and

Tr(U) ≤ ‖IS‖Tr(L) ‖JS‖ ≤ Tr(L).

It will now suffice to show that Tr(U) = Tr(KSR−1S ). Since LRs = Ks we

have URs = Ks. Thus the matrix A which represents the linear map U inthe basis Rs : s ∈ S of HS satisfies

Ks =∑

t∈SA(s, t)Rt.

Evaluating this at any j ∈ S we see that KS(s, j) = (ARS)(s, j). It followsthat KS = ARS and so Tr(U) = Tr(A) = Tr(KSR

−1S ).

3.7 Approximation in reproducing kernel Hilbertspaces

Let K be a symmetric, positive semidefinite kernel on F and H the repro-ducing kernel Hilbert space with kernel K. Assume that we are given somepoints xj ∈ F , 1 ≤ j ≤ n and values yj ∈ R (the ”data”) and we wantto find a function f ∈ H such that the deviation of f(xj) from yj is smallin some sense. Assume also that this deviation is measured by a penaltyfunction (cost function) C of the form

C(f) = c((x1, y1, f(x1)), . . . , (xn, yn, f(xn))) + q (‖f‖K) , f ∈ H, (3.29)

where c : (F × R2)n → [0,+∞) is arbitrary and q : [0,+∞) → [0,+∞) isnondecreasing. A typical example would be the penalty function

C(f) =∑n

j=1(yj − f(xj))2 + λ ‖f‖K ,

where λ > 0. The objective is to find a function f ∈ H which makes thispenalty small. Ideally we want f ∈ H which minimizes the penalty. Thefollowing theorem shows that we have to look only in the linear span of thekernel functions Kxj corresponding to the data points xj:


Theorem 3.7.1 (Representer Theorem). Let xj , yj, j ≤ n, be as above,C = C(f) a penalty function as in (3.29) and

V = span(Kxj : 1 ≤ j ≤ n ).

Then for each f ∈ H we have C(f) ≥ C(πV (f)), where πV (f) denotes theorthogonal projection of f on V .

Remark. This theorem first appeared in [GK71]. The present form is due to[Sch]. Despite the simple proof the theorem provides an extremely powerfulsimplification of the problem of minimizing the cost function C(f) for f ∈ H.Only linear combinations

f =∑

j≤nαjKxj

have to be considered.Proof. Let g = πV (f) and h = f − g. Then ‖g‖K ≤ ‖f‖K and h ⊥ V andso h(xj) =

(h,Kxj

)= 0, that is, g(xj) = f(xj), for all 1 ≤ j ≤ n. It follows

that q (‖f‖K) ≥ q (‖g‖K) and so

C(f) = c((x1, y1, f(x1)), . . . , (xn, yn, f(xn))) + q(‖f‖K)= c((x1, y1, g(x1)), . . . , (xn, yn, g(xn))) + q(‖f‖K)≥ c((x1, y1, g(x1)), . . . , (xn, yn, g(xn))) + q(‖g‖K) = C(g).

3.8 Orthonormal bases

As yet we have no way to find an orthonormal basis for the reproducingkernel Hilbert space H with kernel K on a set F .

If the underlying set F is countable F = (xj), then we can simply applythe Gram-Schmidt orthonormalization procedure to the sequence of kernelfunctions Kxj and obtain an orthonormal basis in that way.

Otherwise we need more structure on the set F . Thus we shall nowassume that F is a closed subset of Euclidean space Rd and that the kernelK on F is continuous and bounded and satisfies∫

FK(x, x)dx <∞. (3.30)

In this case K is called a Mercer kernel on F . Note that symmetry andpositive semidefiniteness are standing assumptions. The continuity of Kimplies that every function f ∈ H is continuous on F (3.4.2).

3.8. ORTHONORMAL BASES 63

Let dx denote Lebesgue measure on F . From the basic kernel inequality|K(x, y)|2 ≤ K(x, x)K(y, y) it follows that the kernel K is square integrable,that is, K ∈ L2(F × F ). It follows that the integral operator T = TK :L2(F ) → L2(F ) is well defined via

(Tf)(x) =∫

FK(x, y)f(y)dy = (f,Kx)2 , f ∈ L2(F ), x ∈ F,

and is a Hilbert-Schmidt operator. Here (·, ·)2 denotes the inner productin L2(F ) whereas (·, ·)K will denote the inner product in the reproducingkernel Hilbert space H. Like wise ‖ · ‖K will denote the norm in H while‖ · ‖2 denotes the L2-norm.

Note that (Tf)(x) is defined at every point x ∈ F , since all kernelfunctions Kx are in L2(F ).

We will now relate the integral operator T to the reproducing kernelHilbert space H with kernel K. In particular it will turn out that theeigenvalues and eigenvectors of T can be used to define an orthonormalbasis of H.

First we show that H is a subset of L2(F ) and that R(T ) ⊆ H, that is,T maps into H. Let g ∈ H. From the reproducing property

g(x) = (g,Kx)K , for all g ∈ H and x ∈ F . (3.31)

it follows that

|g(x)| ≤ ‖g‖K ‖Kx‖K = ‖g‖K

√K(x, x).

This implies that g ∈ L2(F ) by assumption (3.30). In short we have H ⊆L2(F ). Next we claim that the map

Φ : x ∈ F 7→ Kx ∈ H

is continuous. We need the continuity of this map for vector valued integra-tion with values in H below (weak measurablity, see Appendix A). Indeed

‖Kx −Ky‖2K = K(x, x) +K(y, y)− 2K(x, y) → 0,

as x→ y, by continuity of K (along its diagonal). We claim that the rangeR(T ) is contained in H. Let f ∈ L2(F ). We have already seen that Tf canbe viewed as an L2(F )-valued Bochner integral

Tf =∫

Ff(y)Ky dy. (3.32)


However the integrand f(y)Ky takes values in the reproducing kernel Hilbertspace H. Using ‖Ky‖2K = K(y, y), the Cauchy-Schwartz inequality andassumption (3.30) it is easily seen that the integral is norm convergent inH. Let

h :=∫

Ff(y)Ky dy ∈ H

be the value of this integral in H. We claim that h = Tf and in particularTf ∈ H. Let x ∈ F . Commute the integral (3.32) with the evaluationfunctional at x (which is continuous on H) to obtain

h(x) =∫

F

f(y)Ky(x) dy =∫

F

K(x, y)f(y) dy = (Tf)(x),

as desired. Next we claim that

(Tf, g)K = (f, g)2 , for all f ∈ L2(F ), g ∈ H. (3.33)

Indeed, using the representation of Tf as an H-valued Bochner integral andthe reproducing property (Ky, g)K = g(y) we obtain

(Tf, g)K =∫

Ff(y)(Ky, g)Kdy =

∫Ff(y)g(y)dy = (f, g)2 .

We have already seen that H ⊆ L2(F ). From (3.33) it follows that

H ⊆ N(T )⊥. (3.34)

Since K is symmetric the integral operator T is selfadjoint. Next we showthat T is in fact positive. To do this the positive semidefinite property of Kwill be extended from sums to integrals:

Proposition 3.8.1. The integral operator T is positive on L2(F ), that is,(Tf, f) ≥ 0, for all f ∈ L2(F ).

Proof. (A) Assume first that K = 0 outside of A × A, where A ⊆ F iscompact. Let µ be any signed Borel measure on F . For each partitionP = Rj of A into finitely many measurable sets and points xj ∈ Rj let

KP =∑

i,jK(xi, xj)1Ri×Rj ,

and set wj = µ(Rj). Then the wj are finite, KP = 0 outside of A×A and∫F×F

KP(x, y)dµ(x)dµ(y) =∑

i,jwiwjK(xi, xj) ≥ 0.


By continuity of K we can choose a sequence of such partitions Pn suchthat Kn := KPn → K uniformly on A× A and hence on F × F . Since theproduct measure µ⊗ µ on F × F has finite variation it follows that∫

F×FK(x, y)dµ(x)dµ(y) = lim

n

∫F×F

Kn(x, y)dµ(x)dµ(y) ≥ 0.

Now, if f ∈ L2(F ), then 1Af ∈ L1(F ) (Cauchy-Schwartz) and hence thesigned measure µ = 1A(x)f(x)dx with density 1Af is defined. Applying theabove to µ we obtain

(Tf, f) =∫

F×FK(x, y)f(x)f(y)dxdy ≥ 0.

This shows that the operator T is positive.(B) In the general case let φn : F → [0, 1] be a sequence of continuousfunctions with compact support such that φn(x) = 1 if ‖x‖ ≤ n and setKn(x, y) = φn(x)K(x, y)φn(y). Then Kn is continuous, symmetric, positivesemidefinite and satisfies (3.30). Thus the integral operator Tn = TKn onL2(F ) is positive. Note that

Kn → K in L2(F × F )

Thus Tn → T in operator norm and it follows that T is positive also.

Remark 3.8.1 (Positive definiteness). If K is in positive definite it doesnot follow that (Tf, f) > 0, for f 6= 0 (this is equivalent with N(T ) = 0).Let F = [0, 1] and consider an ON-basis φj : j ≥ 0 for L2(F ) whichconsists of continuous functions and let

K(x, y) =∑

n≥1n−2φn(x)φn(y).

Then φ0 ∈ N(T ) but K can be shown to be positive definite for suitablechoice of the φj .

We have now seen that T is a positive compact operator. Consequentlythe eigenvalues λj of T are nonnegative. According to the spectral the-ory of compact, selfadjoint operators (2.10) the complement N(T )⊥ has anorthonormal basis φj consisting of eigenvectors of T . Fix such a basisordered such that the associated eigenvalues λjsatisfy

λ0 ≥ λ1 ≥ · · · ≥ λn ≥ · · · > 0,

Recall that the complement N(T )⊥ contains H. With this we can prove that


Proposition 3.8.2. Set ψj =√λjφj . Then B = ψj : j ≥ 0 is an

orthonormal basis for the reproducing kernel Hilbert space H.

Proof. We have ψj = λ−1/2j Tφj ∈ R(T ) ⊆ H. From (3.33) it follows that the

ψj are pairwise orthogonal in H. Moreover

‖ψj‖2K = (λjφj , φj)K = (Tφj , φj)K = ‖φj‖22 = 1.

This shows that B is an orthonormal system in H. It remains to be shownthat it is complete. Let g ∈ H and assume that g ⊥ ψj in H, for all j ≥ 0.Then g ∈ N(T )⊥ ⊆ L2(F ) and

0 = (λjφj , g)K = (Tφj, g)K = (φj , g)2 ,

that is, g ⊥ φj in L2(F ). Since the φj are a basis of N(T )⊥ it follows thatg = 0 in L2(F ), that is, g = 0 almost surely. Bt we have already seen thatevery function g ∈ H is continuous. Thus we have g(x) = 0 at every pointx ∈ F , that is, g = 0 in H.

In terms of the orthonormal basis ψj above the kernel expansion (3.17) Kbecomes

K(x, y) =∑

jλjφj(x)φj(y) (3.35)

with convergence pointwise uniformly on compact subsets of F × F (see3.4.2). This fact is called Mercer’s theorem. It is simply a special case of3.4.2. Set x = y to obtain

K(x, x) =∑

jλjφ

2j (x). (3.36)

Integrate over X and note that ‖φj‖22 = 1. Since the series on the right has

nonnegative terms we can integrate term by term to obtain∑jλj =

∫X

K(x, x)dx <∞.

Thus T is a trace class operator with tr(T ) =∫F K(x, x)dx.

3.8.1 Second description of H

We continue with the setup of the previous section. The positive square rootQ = T 1/2 is the unique bounded linear operator Q on L2(F ) which satisfies

Qφj =√λjφj,


for all j ≥ 0. Note that Q maps the ON-basis φj of N(T )⊥ onto theON-basis ψj of H. Consequently

Q := T 1/2 : L2(F ) → H (3.37)

is an isometric isomorphism of N(T )⊥ with H. This allows us to give asecond description of the reproducing kernel Hilbert space H:

Let f ∈ N(T )⊥ ⊆ L2(F ) and assume that f = Qg with g ∈ L2(F ). Then(f, φj) = (g,Qφj) =

√λj(g, φj) (all inner products in L2(F )) and it follows

that ∑jλ−1

j |(f, φj)|2 =∑

j|(g, φj)|2 ≤ ‖g‖2

2 <∞. (3.38)

Conversely if the sum on the left is finite, then the series

g :=∑

jλ−1/2j (f, φj)φj

converges in L2(F ) and the sum g satisfies Qg = f . In other words f is inthe range of Q if and only if sum on the left of (3.38) is finite. The conditionf ∈ N(T )⊥ ensures that only the eigenvectors φj of T corresponding tothe nonzero eigenvalues are needed in the expansion of f . The isometricproperty of Q with Qg = f shows that

‖f‖2K = ‖g‖22 =

∑jλ−1

j |(f, φj)|2. (3.39)

From this we see that the reproducing kernel Hilbert space H is the space

H = f ∈ N(T )⊥ ⊆ L2(F ) :∑

jλ−1

j |(f, φj)|2 <∞ (3.40)

endowed with the norm (3.39) and inner product

(f, h)K =∑

jλ−1

j (f, φj)(h, φj),

where the inner products without subscripts are inner products in L2(F ).

Remark 3.8.2. Recall that the eigenvalues λj of T are arranged in decreas-ing order. Thus (3.39) implies

‖f‖2 ≤√λ0 ‖f‖K , f ∈ H,

that is, the norm on H dominates the L2-norm.Since R(T ) ⊆ H the operator T maps H into itself. Using (3.33) with

g = f we see that the restriction of T to H is positive as an operator on H.


Let C denote the norm of T as an operator on L2(F ). Using (3.33) withg = Tf and we obtain

‖Tf‖2K = (f, Tf)2 ≤ C ‖f‖22 ≤ Cλ0 ‖f‖2K

and this shows that T is bounded as an operator on H. Indeed H hasa basis consisting of eigenvectors of T with eigenvalues λj > 0 satisfying∑

j λj <∞. Thus the restriction of T to H is a trace class operator.

Chapter 4

Gaussian Measures

In this chapter we introduce Gaussian measures on separable Hilbert space.These measures are a generalization of multinormal distributions on thefinite dimensional Hilbert space Rd.

4.1 Probability measures in Hilbert space

Distribution of a random object. Let (Ω,F , P ) be a probability spaceand (Ψ,G) a measurable space. A measurable map X : (Ω,F , P ) → (Ψ,G)is called a random object in Ψ. The distribution PX of X on Ψ is the imageX(P ) of the measure P under X, that is, the probability measure on (Ψ,G)defined as

[X(P )](G) = P (X−1(G)), G ∈ G,

The distribution PX of X satisfies

E(f(X)) =∫

Ωf X dP =

∫Ψf(x)PX(dx) (4.1)

for each f ∈ L1(PX) (the transformation formula). For indicator functionsf = 1A, A ∈ G, this is true by definition of the measure PX and the restfollows by the usual extension procedures.

Mean in finite dimensional space. For a random object X with valuesin a general measurable space (Ψ,G) the mean E(X) does not make sense.Recall that the mean is a sort of average and to compute averages in Ψ weneed addition and scalar multiplication on the space Ψ.

These are defined if Ψ = Rd and G = B(Rd) is the family of Borel subsetsof Rd. In this case X is called a random vector and the mean E(X) is defined

69

70 CHAPTER 4. GAUSSIAN MEASURES

asE(X) =

∫Rd

xPX(dx).

If P is any Borel probability measure on Rd, then P is the distribution ofthe random vector X(t) = t defined on (Ω,F , P ) = (Rd,B(Rd), P ) and sowe define the mean µ of P to be the vector

µ =∫

Rd

t P (dt).

The mean is the unique vector µ ∈ Rd satisfying

Λ(µ) =∫

Rd

Λ(t)P (dt) = EP (Λ),

for all linear functionals Λ : Rd → R. The distribution P is called centeredif its mean µ is zero. Thse notions will now be generalized from the finitedimensional space Rd to infinite dimensional Hilbert space H.Cylinder sets. Let H be a Hilbert space. To consider probability mea-sures on H we need to specify the sigma field E of events on which theseprobabilities are defined. At the very least we want each continuous linearfunctional Λ on H to be a measurable map

Λ : (H, E) → (R,B(R)),

that is, we want each continuous linear functional Λ on H to be a randomvariable. It turns out that this requirement will be sufficient for our purposesand so we let E = E(H) be the σ-field generated by the continuous linearfunctionals on H. Then E is also the σ-field generated by the cylinder sets

Z = Z(Λ1, . . . ,Λn, B) = x ∈ H : (Λ1(x), . . . ,Λn(x)) ∈ B , (4.2)

where n ≥ 1, B ⊆ Rn is a Borel set and the Λj are continuous linearfunctionals on H. In this case each Λj has the form

Λj(x) = (x, xj), x ∈ H,

for a unique vector xj ∈ H. We identify Λj with xj and write

Z = Z(x1, . . . , xn, B).

Proposition 4.1.1. An H valued map X is measurable with respect to theσ-field E if and only if the function ΛX is measurable, for each continuouslinear functional Λ on H.

4.1. PROBABILITY MEASURES IN HILBERT SPACE 71

If H is separable then E is the Borel σ-field on H but in general the Borelσ-field is too large to work with.

Note that the family of cylinders contains the space H itself and is closedunder intersection. Consequently a probability measure on E is alreadyuniquely determined by its values on the cylinder sets. This is a standardfact of probability theory (key words: π-system, λ-system, π-λ-theorem).In other words, a probability measure P on E is uniquely determined by itsimages H(P ) under all linear maps

H : H → Rd

into the finite dimensional spaces Rd, d ≥ 1.

Probabilities on H, Mean. A probability on H is a probability measureP on the σ-field E generated by the continuous linear functionals on H. Letp > 0. We say that P is of weak order p if each continuous linear functionalΛ on H is in Lp(P ), that is,

EP (|Λ|p) =∫

H|Λ(x)|pP (dx) <∞.

Assume now that H is separable and let (fn) be a dense sequence in the unitball of H. Then the norm on H satisfies

‖x‖ = supn |(x, fn)| = supn |Λfn(x)|

and is consequently measurable with respect to the sigma field E on H. Wesay that P is of strong order p if∫

H‖x‖p P (dx) <∞.

It is easy to see that strong order p implies weak order p. The terms firstand second order refer to the cases p = 1 and p = 2.

Proposition 4.1.2 (Mean). Asume that P has weak first order. Then theidentity map I : (H, E , P ) → H is weakly integrable, that is, there existsµ ∈ H such that

Λ(µ) = EP (Λ),

for all continuous linear functionals Λ on H. The element µ is uniquelydetermined and called the mean of P .


Remark. In the terminology of Appendix (A) µ is the weak integral

µ =∫

HxP (dx)

which justifies the terminology ”mean of P”.Proof. For x ∈ H let Λx be the bounded linear functional Λx(·) = (·, x) asusual. We have to show only that

Θ : x ∈ H 7→ EP (Λx)

is a bounded linear functional on H since this implies the existence of aunique vector µ ∈ H such that Θ(x) = (x, µ), that is

Λx(µ) = (x, µ) = EP (Λx),

for all x ∈ H. For n ≥ 1 let An = u ∈ H : ‖u‖ ≤ n and defineΘn : H → R as

Θn(x) = EP (1AnΛx), x ∈ H.Clearly each Θn is a linear functional on H. Moreover Θn(x) → Θ(x), asn ↑ ∞, for all x ∈ H, since Λx ∈ L1(P ) and An ↑ H. By the UniformBoundedness Principle a pointwise limit of bounded linear functionals isagain a bounded linear functional on H. Thus it will suffice to show that eachΘn is bounded. Indeed, since |Λx| ≤ n ‖x‖ on An, we have Θn(x) ≤ n ‖x‖,for all x ∈ H.

The probability P is called centered if its mean µ is zero, equivalentlyif every continuous linear functional on H is a random variable with meanzero under P . Let τw be the translation operator

τw : x ∈ H 7→ x−w ∈ H.

Then −w + P := τw(P ) (image measure) is a probability on H with meanµ−w. In particular the probability Q := −µ+P is centered with P = µ+Q.This will allow us to reduce many proofs to the case of centered probabilitieson H.Covariance operator. Assume now that the probability P on H has weaksecond order and consequently weak first order also. Let µ ∈ H be the meanof P . Then

〈x, y〉P = CovP (Λx,Λy) (4.3)= EP (ΛxΛy)− EP (Λx)EP (Λx)= EP (ΛxΛy)− (µ, x)(µ, y),

x, y ∈ H, defines a positive semidefinite bilinear form on H.

4.2. GAUSSIAN MEASURES ON HILBERT SPACE 73

Proposition 4.1.3 (Covariance operator). The covariance 〈x, y〉P isa continuous positive semidefinite bilinear form on H. Consequently thereexists a unique bounded linear operator Q on H such that

CovP (Λx,Λy) = (Qx, y), x, y ∈ H.

Q is selfadjoint and positive and is called the covariance operator of P .

Proof. As a consequence of the Uniform Boundedness Principle a pointwiselimit of continuous bilinear forms on H is again a continuous bilinear formon H. For n ≥ 1 let An = u ∈ H : ‖u‖ ≤ n and define 〈x, y〉n as

〈x, y〉n = EP (1AnΛxΛy), x, y ∈ H.

Since |Λx| ≤ n ‖x‖ on An and a similar inequality holds for Λy, we have

|〈x, y〉n| ≤ n2 ‖x‖ ‖y‖ ,

that is, the bilinear forms 〈x, y〉n are continuous. Since ΛxΛy ∈ L1(P ) wehave 〈x, y〉n → 〈x, y〉P , as n ↑ ∞, for all x.y ∈ H. Thus the bilinear form〈x, y〉P is continuous.

The operator Q now exists and is uniquely determined according to theLax-Milgram Theorem (2.1.1). From the symmetry of 〈x, y〉P it follows thatQ is selfadjoint. The positive semidefiniteness of 〈x, y〉P implies that Q ispositive.

Exercise 4.1.1. Let P be a probability on H which has weak second order,mean µ and covariance operator Q. Let τµ the translation operator x ∈H 7→ x−µ ∈ H. Then the image measure −µ+P = τµ(P ) has weak secondorder, mean zero and the same covariance operator Q.

4.2 Gaussian measures on Hilbert space

Recall that a random variable Z is called normal if its distribution on R hasdensity

f(t) =1

σ√

2πexp

(−(t− µ)2

2σ2

).

In this case µ = E(X) and σ2 = V ar(X). We assume some familiarity withthe properties of Gaussian (multinormal) probability distributions on Rd. Inparticular let us recall that an Rd-valued random vector X = (X1, . . . ,Xd)is Gaussian if and only if the random variable

a ·X = a1X1 + · · ·+ adXd


is normal, for every vector a = (a1, . . . , ad) ∈ Rd. In this case the randomvariables Xj are independent if and only if they are uncorrelated. Herethe interesting fact is that zero correlation implies independence for jointlynormal variables.

Rephrased in terms of the distribution of X this means that a probabilitymeasure P on the Borel sets of Rd is multinormal if and only if the linearfunctional

Λa : x ∈ Rd 7→ (x, a) ∈ Ris a normally distributed random variable under P , for every a ∈ Rd.

All this makes sense in any infinite dimensional Hilbert space H. Let Ebe the σ-field generated by the continuous linear functionals on H.

A probability measure P on (H, E) is called Gaussian if the boundedlinear functional

Λa : x ∈ H 7→ (x, a) ∈ Ris a normal random variable under P , for each a ∈ H. Assume that P is sucha probability. Since a normal random variable has moments of all orders, Phas weak weak order p for all p > 0. In particular P has weak second orderand so the mean µ and covariance operator Q of P are well defined.

Proposition 4.2.1. Let x1, . . . , xn ∈ H. Then the random vector X =(Λx1 , . . . ,Λxn) on (H, P ) is multinormal in Rn with mean X(µ) and covari-ance matrix Cij = (Qxi, xj).

Proof. By the very definition of Q the matrix Cij = (Qxi, xj) is the covari-ance matrix of X. The mean E(X) is given as

E(X) = (EP (Λx1), . . . ,EP (Λxn))= (Λx1(µ), . . . ,Λxn(µ)) = X(µ).

To see that X is multinormal it will suffice to show that the random variablea ·X is normal, for each vector a ∈ Rn. Indeed, for such a we have

a ·X = a1Λx1 + · · ·+ anΛxn = Λx,

where x = a1x1 + · · ·+ anxn ∈ H and it follows that a ·X is normal.

Exercise 4.2.1. Let P be any probability on (H, E), m ∈ H and τm thetranslation operator τm : x ∈ H 7→ x−m ∈ H. For any random vector

W : (H, E , P ) → Rk

the distribution of W under −m + P = τm(P ) (image measure) is the dis-tribution of W τm under P . Conclude from this that P is Gaussian if andonly if −m+ P is a Gaussian measure on H.


Fourier transform. The Fourier transform P of the probability P is thefunction P : H → C defined as

P (u) =∫

Hexp(i(x, u))P (dx), u ∈ H. (4.4)

Recall that P is uniquely determined by its values on cylinder sets, equiva-lently, by its images H(P ) under all linear maps H : H → Rd into the finitedimensional spaces Rd, d ≥ 1.

Using the Transformation formula we see that (4.4) determines the Fouriertransforms H(P ) of all these images uniquely. Consequently the Fouriertransform P determines the probability P uniquely.If P has weak first order and u ∈ H we can rewrite (4.4) as

P (u) =∫

Hexp(it) ΛudP =

∫R

exp(it)Pu(dt) = Pu(t),

where Pu = Λu(P ) is the image of the measure P under the random variableΛu, that is, the distribution of Λu under P . Assume now that P is a Gaussianmeasure with mean µ and covariance operator Q. Then Λu is normal withmean Λu(µ) = (µ, u) and variance (Qu, u). Recalling the form of the Fouriertransform (characteristic function) of a normal random variable we obtain

P (u) = exp

(i(µ, u)− 1

2(Qu, u)

), u ∈ H. (4.5)

Theorem 4.2.1. The covariance operator Q is compact.

Proof. We use the fact that the positive operator Q has a square root, thatis, there is an operator S ∈ B(H) with Q = S2, see 2.3.6.

It will suffice to show that S is compact. To see this we have to show onlythat ‖Sen‖ → 0, for each orthonormal sequence (en) ⊆ H (2.4.3). Indeed, if(en) is such a sequence, then∑

n|(x, en)|2 ≤ ‖x‖2 <∞

and hence (x, en) → 0, as n ↑ ∞, for each x ∈ H. Then P (en) → 1 by theDominated Convergence Theorem applied to the definition of the Fouriertransform of P . From (4.5) it now follows that

‖Sen‖2 = (Qen, en) → 0.

We will see that Q is in fact a trace class operator. For this we need thefollowing result


Proposition 4.2.2. Let (Xn) be a sequence of independent normal randomvariables (all defined on the same probability space) with means E(Xn) = 0and variances E(X2

n) = σ2n. Then∑

nX2

n <∞ almost surely =⇒∑

nσ2

n <∞.

Proof. For x ∈ [0, 1] we have the inequality x/2 ≤ log(1 + x) ≤ x (allfunctions are equal at x = 0, now consider the derivatives). From this itfollows that for any sequence (xn) of nonnegative numbers∏∞

n=1(1 + xn) <∞ ⇐⇒

∑nxn <∞.

Standard manipulations with the normal density yield that

E[exp(−X2

n)]

=1√

1 + 2σ2n

. (4.6)

Now assume that f :=∑

nX2n <∞ almost surely. Then∏n

j=1exp(−X2

j ) = exp(−

∑n

j=1X2

j

)→ g := e−f > 0

almost surely and by the Dominated Convergence Theorem we have

E[∏n

j=1exp(−X2

j )]→ E(g) > 0.

The factors are independent and so we can commute the expectation withthe product. Using (4.6) we obtain∏n

j=1

1√1 + 2σ2

j

→ E(g) > 0

and so ∏n

j=1(1 + 2σ2

j ) → E(g)−2 <∞.

As we have seen above this implies∑

n σ2n <∞.

Recall that P is a centered Gaussian measure on H with mean µ and co-variance operator Q ∈ B(H). Recall also that finitely many jointly normalrandom variables are independent if and only if they are uncorrelated. Withthis we can show

Theorem 4.2.2. The covariance operator Q is a trace class operator.


Proof. Let τµ be the translation operator x ∈ H 7→ x−µ ∈ H. Since Q is alsothe covariance operator of the centered Gaussian measure −µ+ P = τµ(P )we may assume that P itself has mean zero.

Q is positive and we have already seen that Q is compact. Let (en) be anorthonormal basis of N(Q)⊥ consisting of eigenvectors of Q with associatedeigenvalues (λn). For each n ≥ 0 let Λn(x) = (x, en), x ∈ H. Then the Λn

are mean zero normal random variables on (H, P ) with variance (Qen, en).For each n the random variables Λj , j < n, are jointly normal (4.2.1)

with mean zero and are uncorrelated and hence independent. Moreover wehave ∑

n|Λn(x)|2 =

∑n|(x, en)|2 ≤ ‖x‖2 <∞, ∀x ∈ H.

Using (4.2.2) it follows that∑

n(Qen, en) =∑

n λn < ∞ and this impliesthat Q is a trace class operator.

Proposition 4.2.3. If H is separable, then P is concentrated on N(Q)⊥.

Proof. If y ∈ N(Q), then Λy has variance zero and hence is the point measureat its mean zero, that is Λy(P ) = δ0. Thus P ([Λy 6= 0]) = 0, that is, P isconcentrated on [Λy = 0] = y⊥. Since N(Q) is separable it follows that Pis concentrated on N(Q)⊥.

Proposition 4.2.4. If H is separable then∫H‖x‖2 P (dx) = Tr(Q) <∞.

Thus P has strong second order.

Proof. Let en be an ON-basis for N(Q)⊥ with Qen = λnen. Since P isconcentrated on N(Q)⊥ we have∫

H‖x‖2 P (dx) =

∫N(Q)⊥

‖x‖2 P (dx)

=∫

N(Q)⊥

∑n|(x, en)|2P (dx)

≤∑

nEP

(Λ2

en

)=

∑n(Qen, en) = Tr(Q).

Proposition 4.2.5. The mean µ and covariance operator Q determine theGaussian measure P uniquely.


Proof. Recall that P is uniquely determined by its values on cylinder sets,equivalently, by its images H(P ) under all linear maps H : H → Rd into thefinite dimensional spaces Rd, d ≥ 1. From (4.2.1) it follows that the meanand covariance operator determine the distribution H(P ) uniquely.We will usually be able to restrict attention to the case where the meanµ = 0. In this case we write P = NQ, read P is the mean zero Gaus-sian measure with covariance operator Q. Bounded linear operators mapGaussian measures to Gaussian measures:

Proposition 4.2.6. Let F be another Hilbertspace and U : H → F a boundedlinear operator. Then the image U(P ) of P is a mean zero Gaussian measureon F with mean U(µ) and covariance operator UQU∗.

Proof. Replacing P with −µ + P we may assume that P has mean µ = 0.Set N = U(P ), let v,w ∈ F and set x = U∗v, y = U∗w ∈ H. Note that

Λv U = Λx and Λw U = Λy

and hence the distribution of Λv under N is the distribution of Λx under P ,in particular normal. Moreover

EN [ΛvΛw] = EP [(Λv U)(Λw U)] = EP [ΛxΛy] = (Qx, y) = (UQU∗v,w).

Thus N is Gaussian on F with covariance operator UQU∗.

Theorem 4.2.3. For each positive trace class operator Q on H there existsa unique centered Gaussian measure P on H with covariance operator Q.

Remark. Let µ ∈ H. Then µ+P is the unique Gaussian measure on H withmean µ and covariance operator Q.Proof. Write Q =

∑n λn(en⊗en), where en is a sequence of eigenvectors of

Q with associated eigenvalues λn and let (Zn) be a sequence of independentstandard normal random variables on some probability space (Ω,F , µ).

Note that the Zn are pairwise orthogonal unit vectors in L2(µ). Thusthe series

∑n anZn converges in L2(µ) for each sequence (an) ∈ l2 (square

summable). The sum is a again a mean zero normal variable on (Ω,F , µ)since this property is preserved under L2-limits.Define Z : Ω → H as

Z =∑

n

√λnZnen.

The series converges in H for each ω ∈ Ω satisfying∑

n λnZ2n(ω) < ∞ and

this is satisfied µ-almost surely. Indeed∫Ω

∑nλnZ

2ndµ =

∑nλnE[Z2

n] =∑

nλn <∞

4.3. CAMERON-MARTIN SPACE 79

which implies the finiteness of the integrand µ-almost surely. Let x ∈ H.Then

Λx Z =∑

n

√λn(x, en)Zn

is a mean zero normal variable (in particular measurable). Recall that thedistribution of Λx under P is the distribution of Λx Z under µ. It followsthat Z : (Ω,F) → (H, E) is measurable and the distribution P = Z(µ)(image measure) of Z on H a Gaussian measure.

Let x, y ∈ H. Using the orthonormality of the sequence (Zn) ⊆ L2(µ)we have

EP [ΛxΛy] = Eµ [(Λx Z)(Λy Z)]

=∑

nλn(x, en)(y, en) = (Qx, y).

Thus Q is the covariance operator of P .

4.3 Cameron-Martin space

Let Q be a trace class operator on H and P = NQ the centered Gaussianmeasure with covariance operator Q on H. Let (en) an orthonormal basisof N(Q)⊥ consisting of eigenvectors of Q with associated eigenvalues (λn).Then

Q =∑

n

λn(en ⊗ en) (4.7)

Each continuous linear functional Λx on H is a normal variable and hencesquare integrable under P . We thus have a linear map

Λ : x ∈ H 7→ Λx ∈ L2(P ). (4.8)

This map is not an embedding in general (the null space is N(Q)). Withthe usual identification of an element x ∈ H with the linear functional Λx

the L2-norm ‖x‖2 is defined for each element x ∈ H and in fact

‖x‖22 = ‖Λx‖22 = (Qx, x) ≤ ‖Q‖ ‖x‖2

Note that ‖ · ‖2 is not a norm on H. It is only a seminorm. We shall referto it as the L2(P )-norm anyway. It is weaker than the norm on H.

Let H2 denote the closure of the subspace Λ(H) = Λx : x ∈ H inL2(P ).

Proposition 4.3.1. If f ∈ H2 then f is a normal random variable underP with mean zero and variance ‖f‖22.


Proof. This is true for f ∈ Λ(H) and follows for the remaining f since anL2-limit of normal variables is again a normal random variable.

For h ∈ H define

|h|P = sup|Λh(u)| : ‖u‖2 ≤ 1 (4.9)= sup|(h, u)| : (Qu, u) ≤ 1 (4.10)

≥ ‖Q‖−1/2 ‖h‖ (4.11)

and letHP = x ∈ H : |x|P <∞.

The space HP is called the Cameron-Martin space of P . Note that HP

consists exactly of those h ∈ H for which the functional Λh is continuous inthe L2(P )-norm in which case the norm of |h|P is the norm of the functionalΛh relative to the L2(P )-norm on H.

Every linear functional continuous in the L2(P )-norm on H is continuousin the norm of H and hence of the form Λh, for some h ∈ H. Thus HP isthe dual of a seminormed space and hence complete.

Proposition 4.3.2. For each h ∈ HP there exists a unique h in H2 suchthat

(x, h) = Λh(x) = (Λx, h)2 = EP(Λxh

), (4.12)

for all x ∈ H. The map J : h ∈ HP 7→ h ∈ H2 is an isometric isomorphismof HP with H2.

Proof. If h ∈ HP then the functional Λh vanishes on N(Q) = N(Λ) andhence can be viewed as a linear functional on the subspace Λ(H) ⊆ L2(P ).As such it is continuous and hence has a unique extension to a continuouslinear functional on H2.

Consequently Λh can be identified with a unique element of H2, that is,there is a unique element h ∈ H2 such that (4.12) holds. Clearly the mapJ : h ∈ HP 7→ h ∈ H2 is linear and isometric.

Conversely, if f ∈ L2(P ), then the functional x ∈ H 7→ EP (Λxf) iscontinuous in the L2(P )-norm on H and hence of the form Λh, for someh ∈ HP and it follows that f = h. Thus J is surjective.

Proposition 4.3.3. HP is the range of the positive square root Q1/2 andwe have |h|P =

∥∥Q−1/2h∥∥, where v = Q−1/2h denotes the unique element

v ∈ N(Q)⊥ with Q1/2v = h.

4.4. REGRESSION WITH GAUSSIAN MEASURES 81

Proof. Recall that for any sequence (an) of numbers we have(∑∞n=0 |an|2

)1/2 = sup∑∞

n=0 anbn :∑∞

n=0 |bn|2 ≤ 1

(4.13)

= sup ∑N

n=0 anbn :∑N

n=0 |bn|2 ≤ 1, N ≥ 1

Let h ∈ HP and note that the definition of the norm |h|P can be rewrittenas

|h|P = sup

(h, u) : u ∈ N(Q)⊥, ‖Q1/2u‖ ≤ 1. (4.14)

Recall that h ∈ HP implies that h ∈ N(Q)⊥. For all h, u ∈ N(Q)⊥ we canwrite

h =∑

n(h, en) and u =

∑n(u, en)en.

From (4.7) we have Q1/2 =∑

j

√λjφj ⊗ φj and with this we can rewrite

(4.14) as

|h|P = sup∑

n(h, en)(u, en) :∑

n λj |(u, en)|2 ≤ 1

= sup ∑

n[λ−1/2j (h, en)][λ1/2

j (u, en)] :∑

n[λ1/2j (u, en)]2 ≤ 1

Using (4.13) this becomes

|h|2P =∑

nλ−1

j (h, en)|2.

We have already seen in (2.11) that the set of h for which the right handside is finite is the range of Q1/2 and for each such h the vector

v =∑

nλ−1/2j |(h, en)en

is the unique solution of Q1/2v = h in the closed linear span of the en, thatis, in N(Q)⊥. Obviously |h|P = ‖v‖.

4.4 Regression with Gaussian measures

Let F be any nonempty set. A function f : F → R is observed at finitelymany points s1, . . . , sn ∈ F with values

yj = f(sj) (4.15)

but is otherwise unknown. We want to estimate f . We assume that f is insome Hilbert space H of functions on F . A centered Gaussian measure P


defined on the σ-field E generated by the continuous linear functionals onH is placed on H to serve as a prior probability in a Bayesian computation.The measure P will be specified below. Let

Es : f ∈ H 7→ f(s) ∈ R (4.16)

denote the evaluation functional at s ∈ F and rewrite the data (4.15) as

Ej(f) = yj (4.17)

where Ej is the evaluation functional at the point sj. The evaluation func-tionals Es, s ∈ F , must be continuous linear functionals on H and conse-quently H a reproducing kernel Hilbert space. Let K : F × F → R be thekernel of H. Then

Es(f) = f(s) = (f,Ks), for all f ∈ H.

The Ks ∈ H are the kernel functions Ks(t) = K(s, t). Fix an orthonormalbasis ψj : j ≥ 1 of H.

The evaluation functionals (4.16) are Gaussian random variables on theprobability space (H, E , P ). With this the regressor f is computed as theconditional mean of P given Ej = yj for j = 1, . . . , n.

In analogy to the finite dimensional case the mean of P is the H-valuedintegral ∫

Hf P (df), (4.18)

that is, the expectation EP [I], where the ”random variable” I : (H, E , P ) →H is the identity I(f) = f . This abstract description merely serves tomotivate the procedures below. The vector valued integral (4.18) commuteswith all continuous linear functionals Λ on H, that is,

Λ(EP [I]

)= EP (Λ I) =

∫H

Λ(f)P (df)

and the same holds true if the ordinary expectation is replaced with a con-ditional expectation. The regressor f is the conditional expectation

f = EP [I|data] = EP [ I | Ej = yj, j ≤ n ] (4.19)

and so we haveΛ(f) = EP [ Λ | Ej = yj, j ≤ n ] (4.20)


for each continuous linear functional Λ on H. (note that Λ I = Λ). Thusrather than computing the regressor f globally as in (4.19) we compute Λ(f)for enough continuous linear functionals Λ on H to obtain a good view of f .

If Λ = Es then Λ(f) = f(s) is our prediction for the value of f at thepoint s in light of the data (4.17). However we will be more interested in anexpansion

f =∑

k

Ak(f)ψk

with coefficients Ak(f) = (f, ψk) in the basis ψk of H. The functionalsΛ of interest are then the coefficient functionals Ak(f). The regressor f isexpanded as

f =∑

k

akψk (4.21)

with coefficientsak = EP [Ak | Ej = yj, j ≤ n ] (4.22)

conditioned on the data. The computation of aj involves only the finitedimensional distribution of the random vector

W = (E1, . . . , En, Ak). (4.23)

on Rn+1 under the probability P . Indeed, since P is a centered Gaussianmeasure, W is a multinormal random vector with mean zero and thus iscompletely determined by its covariance matrix C.

C cannot yet be computed since the measure P has not yet been spec-ified. We do this now. Choose any sequence (σj) of positive numbers suchthat ∑

jσj <∞

and let Q : H → H be the unique bounded linear operator satisfying

Qψj = σjψj , for all j ≥ 1.

Then Q is a positive trace class operator on H. Let P = N(0, Q) be theGaussian measure with mean zero and covariance operator Q on H. Withthis

Cov(Λf ,Λg) = (Qf, g) (4.24)

(inner product in H), for all functions f, g ∈ H. Here Λf denotes the con-tinuous linear functional Λf (h) = (h, f). By the reproducing property of Hthe evaluation functionals Es are the functionals ΛKs and it follows that

Cov(Es, Et) = (QKs,Kt). (4.25)


Expand Ks in terms of basis functions ψj as

Ks =∑

j(Ks, ψj)ψj =

∑jψj(s)ψj

and note that this implies

QKs =∑

jσjψj(s)ψj .

With this we can rewrite (4.25) as

Cov(Es, Et) =∑

jσjψj(s)ψj(t) (4.26)

prompting us to define a new kernel L(s, t) as

L(s, t) =∑

jσjψj(s)ψj(t), s, t ∈ F.

L is a symmetric, positive definite kernel on F but it is not the kernel Kwhich has the expansion

K(s, t) =∑

jψj(s)ψj(t),

see Proposition (3.17). With this

Cov(Ei, Ej) = L(si, sj) (4.27)

Observing that the coefficient functionals Ak are the inner products with ψk

we can compute the remaining covariances of W as

V ar(Ak) = (Qψk, ψk) = σk

Cov(Ak, Ej) = (Qψk,Ksj) = σk(ψk,Ksj ) = σkψk(sj).

In short the upper half of the (symmetric) covariance matrix has the form

C =

L(s1, s1) L(s1, s2) . . . L(s1, sn) σkψk(s1)

L(s2, s2) . . . L(s2, sn) σkψk(s2). . .

L(sn, sn) σkψk(sn)σk

(4.28)

Now use the Cholesky factorization C = RR′ to condition Ak on the Ej .See Appendix B.


Recall that the lower triangular Cholesky root R is computed from thetop left corner down. The last row and column of C only come into play forthe last row of R. If we delete the last row and column of R we obtain theCholesky root of the L(si, sj) submatrix.

In other words we have to compute the Cholesky root of the L(si, sj)submatrix only once and can then reuse it for every coefficient Ak. This ismore efficient than matrix inversion which is commonly used for conditioninga multinormal vector but it is of the same order O(n3) of complexity.

More specifically set R = (Rij)1≤j≤i≤n+1 and note that the randomvector (4.23) has mean zero and so can be written as

E1 = R11Z1

E2 = R21Z1 +R22Z2

. . .

En = Rn1Z1 +Rn2Z2 + · · · +RnnZn

Ak = Rn+1,1Z1 +Rn+1,2Z2 + · · ·+Rn+1,nZn +Rn+1,n+1Zn+1

where the Zj are independent standard normal variables and equality holdsin distribution. We are given that Ej = Ej(f) = yj, j ≤ n. Substitute yj inplace of Ej in the first n equations, solve these for Z1, . . . , Zn and enter thesolutions Z1, . . . , Zn into the last equation to obtain

Ak = (Rn+1,1Z1 + · · · +Rn+1,nZn) +Rn+1,n+1Zn+1.

It follows that the conditional distribution of Ak given Ej = yj is normalwith mean and variance

E(Ak) = Rn+1,1Z1 + · · ·+Rn+1,nZn (4.29)V ar(Ak) = R2

n+1,n+1. (4.30)

Example 4.4.1 (Finite dimensional case). In practice the followingapproach is reasonable: we have functions ψj to be used as a basis forexpansion and an a priori bound N such that all expansions

f =∑

jAj(f)ψj

will be cut off after the N -th term. In this case we can work with the finitedimensional reproducing kernel Hilbert space H with kernel

K(s, t) =∑N

j=1ψj(s)ψj(t) (4.31)


in great generality: F can be any nonempty set and ψ1, . . . , ψN any systemof linearly independent functions on X. In particular we have λj = 1. TheGaussian measure P on H is introduced with σj = 1. In this case we haveL(s, t) = K(s, t) and σk = 1 in the covariance matrix (4.28) above.

The ψj are then automatically an orthonormal basis of the reproducingkernel Hilbert space H = span(ψ1, . . . , ψN). See Example 3.3.1. This doesnot mean that all kernels (4.31) are equally useful. If F is a closed subsetof Rd we will still prefer functions ψj which are orthogonal in L2(F ).

Note that the Cholesky factorization needs the matrix (4.28) to be pos-itive definite (rather than only positive semidefinite). From this it followsthat we must have N ≥ n, where n is the number of data points. In fact letL be the L(si, sj) submatrix of (4.28) and u = (u1, . . . , un) ∈ Rn. Then

(Lu, u) = uLu′ =∑N

k=1

∣∣∣∑n

i=1uiψk(si)

∣∣∣2 .Now if N < n, then the system of equations∑n

i=1uiψk(si) = 0, k = 1, . . . ,N

has a nontrivial solution u (more variables ui than equations) and the sub-matrix L of (4.28) is not positive definite.

4.4.1 Model choices

The procedure outlined above relies on three ingredients:

1. A reproducing kernel Hilbert space H on the set F characterized byits positive semidefinite kernel K : F × F → R.

2. An orthonormal basis ψj : j ≥ 1 in H.

3. A sequence of positive numbers σj with∑

j σj <∞.

The basic ingredient is the expansion

f =∑

jAj(f)ψj with Aj(f) = (f, ψj)H (4.32)

of functions in H in terms of basis functions ψj . Let us now reflect on thesignificance of the numbers σj. Recall that the covariance operator Q ofour probability P on H satisfies Qψj = σjψj . Fix n ≥ 1 and consider therandom vector

W = (A1, A2, . . . , An)


on the probability space (H, E , P ). Since P is Gaussian this vector is multi-normal. Moreover the covariances are given as

Cov(Ai, Aj) = (Qψi, ψj)H =σi, i = j0, i 6= j.

In short, the coefficient functionals are independent normal variables withvariances V ar(Aj) = σj. Prior knowledge about the Aj (if any) can now beused to specify the σj . At the very least we should reflect upon the speedof convergence σj → 0. The faster this convergence, the more narrowly P isconcentrated on the first few basis functions.

Next we should think about the Hilbert space H and the basis functionsψj. There are various scenarios:

(I) F = sj is countable (or even finite) and the kernel K and so theHilbert space H are given. In this case we might apply the Gram-Schmidtorthonormalization procedure to the kernel functions Ksj to obtain the basisfunctions ψj.

(II) F is a closed subset of Rd and we already have basis functions φj whichare orthonormal in L2(F ) and which we like to use for expansion. Now weconstruct a kernel K which defines H so that the functions

ψj =√λjφj

are an orthonormal basis for H. In this case we choose a sequence of numbersλj > 0 with

∑j λj <∞ and set

K(s, t) =∑

jλjφj(s)φj(t), s, t ∈ F. (4.33)

Let us assume that the basis functions φj ∈ L2(F ) are continuous, that theλj are chosen so that we have pointwise convergence at all points s, t ∈ Fand the sum K(s, t) is continuous and bounded on F . Set s = t in (4.33)and integrate term by term to obtain∫

FK(s, s)ds =

∑jλj <∞

We are now in the situation of section 3.8. The functions ζj(s, t) = φj(s)φj(t)are orthonormal in L2(F × F ). Consequently the series (4.33) converges inthe norm of L2(F ×F ) and so K ∈ L2(F ×F ) is a square integrable kernel.We have

Ks =∑

jλjφj(s)φj


and so the associated integral operator T = TK satisfies

(Tφj)(s) = (φj ,Ks)2 = λjφj(s),

that is, Tφj = λjφj. Let now H be the reproducing kernel Hilbert spacewith kernel K. Then the functions

ψj =√λjφj

are an orthonormal basis for H.Here the significance of the λj is as follows: the kernel reproducing

Hilbert space H is the range of the square root T 1/2 of T . Note that T 1/2 isthe integral operator with square integrable kernel

H(s, t) =∑

j

√λjφj(s)φj(t), s, t ∈ F.

The series converges in L2(F × F ) but no other convergence is known ingeneral. Hence, even if the basis functions φj are smooth we don’t evenknow wether H is continuous. However if the λj → 0 fast enough thensmoothness properties of H can be inferred. The faster the the λj → 0 thesmoother H will be known to be. The smoothness of H is inherited by thefuntions g ∈ H which have the form g = THf , for some f ∈ L2(F ).

Basis functions. If F = [−1,+1], then the Legendre polynomials (Ap-pendix C) Pn(s) are an orthonormal basis of L2(F ). These polynomialsfollow a recursion from which they can be computed quite efficiently.

If F = [1,+1] × [−1,+1], then the polynomials Pn(s)Pm(t), n,m ≥ 0,are an orthonormal basis in L2(F ) and this procedure can be iterated forhigher dimensional rectangles F .

(III) F is a closed subset of Rd and we have a kernel K that we have towork with. In this case we have to come up with the eigenfunctions φj andeigenvalues λj to be able to proceed as in (II). In general an analytic solutionis a tough proposition.

In actual computation everything is discretized, functions become vec-tors, integral operators become matrices and the computation of eigenvaluesand eigenvectors is feasible if the problem is small enough.

Chapter 5

Square Integrable Processes

Most of this chapter follows [JB02] with some simplifications and [RA].

5.1 Integrable processes

Let F be a nonempty set, (Ω,F , P ) a probability space and H a Hilbertspace. A random field X on F is a collection of random variables

Xt : (Ω,F , P ) → R, t ∈ F,

all defined on the same probability space (Ω,F , P ) which will now be fixed.We will also refer to X as a stochastic process on F . The process X is saidto be integrable of order p > 0 if

E (|Xt|p) <∞, for all t ∈ F.

X is called integrable if it is integrable of order p = 1 and square integrableif it is integrable of order p = 2. If X is integrable then the mean functionm : F → R is defined as

m(t) = E(Xt), t ∈ F.

If X is square integrable the covariance kernel K : F ×F → R is defined as

K(s, t) = Cov(XsXt) = E(XsXt)−m(t)m(s), s, t ∈ F.

For each finite subset S ⊆ F and coefficients a : S → R we have∑s,t∈S

a(s)a(t)K(s, t) = E∣∣∣∑

s∈Sa(s)Xs

∣∣∣2 ≥ 0

89

90 CHAPTER 5. SQUARE INTEGRABLE PROCESSES

and it follows that K is a symmetric, positive semidefinite kernel on F .

Loeve isomorphism. LetX be a square integrable process on F with meanfunction m(t) = 0, K the covariance kernel of X and H the reproducingkernel Hilbert space with kernel K. Since X has second order we haveXt ∈ L2(Ω,F , P ), for all t ∈ F . Let now

L2X = span (Xt : t ∈ F ) ⊆ L2(Ω,F , P )

be the closed linear span of the Xt, t ∈ F , in the Hilbert space L2(Ω,F , P ).We will now see that L2

X is isomorphic to the kernel reproducing Hilbertspace H and exhibit an isometric isomorphism.

Let W ⊆ L2(Ω,F , P ) be the linear span of the Xt, t ∈ F . Since theXt all have mean zero by assumption, we have K(s, t) = E(XsXt), for alls, t ∈ F , and so∥∥∥∑

s∈Sa(s)Xs

∥∥∥2

2=

∑s,t∈S

a(s)a(t)K(s, t) =∥∥∥∑

s∈Sa(s)Ks

∥∥∥2

K, (5.1)

for all finite subsets S ⊆ F and coefficients a : S → R. This means that thelinear map

J :∑

s∈Sa(s)Xs ∈W 7→

∑s∈S

a(s)Ks ∈ H

is well defined and isometric. Consequently J extends uniquely to an isom-etry J : L2

X → H. The range of J is dense and closed and hence J isonto and consequently an isometric isomorphism of L2

X with H. Note thatJ(Xt) = Kt and that J preserves inner products. Using the reproducingproperty of the kernel functions in H we obtain

[J(Y )](t) = (J(Y ),Kt)K = (Y,Xt)L2 = E(Y Xt),

for all Y ∈ L2X and all t ∈ F . The isomorphism J is called the Loeve

isomorphism.

5.2 Processes with sample paths in an RKHS

Let X be a random field on the set F with underlying probability space(Ω,F , P ) which is assumed to be complete. For each ω ∈ Ω the sample pathX(ω) ∈ RF is the function

t ∈ F 7→ Xt(ω) ∈ R.

We should think of this as random surface defined over the set F . Assumenow that R : F × F → R is any symmetric, positive semidefinite kernel on

5.2. PROCESSES WITH SAMPLE PATHS IN AN RKHS 91

F and let HR be the reproducing kernel Hilbert space with kernel R. Recallthat HR ⊆ RF is a Hilbert space of functions on th set F .

The main question to be studied below will be under which conditionson the covariance kernel K of X the process X has sample paths X(ω),ω ∈ Ω, which are in HR almost surely.

For now let us assume that this is the case. Then we can change X ona null set such that X(ω) ∈ HR, for all ω ∈ Ω. With this we can view X asa map

X : ω ∈ Ω 7→ X(ω) ∈ HR

as a map from (Ω,F , P ) to HR. We now claim that this map is automaticallymeasurable for the σ-field E generated by the cylinders, equivalently, by thebounded linear functionals on HR.

Let H be the set of all f ∈ HR such that the function Λf X : Ω → R ismeasurable on (Ω,F , P ). Here Λf is the bouneded linear functional Λf (·) =(·, f)R on HR as usual.

Since linear combinations and pointwise limits of measurable functionsare again measurable, H is a closed subspace of HR. H contains all the kernelfunctions Kt, t ∈ F . By the reproducing property ΛKt is the evaluationfunctional at t and so

ΛKt X = Xt

is measurable. It follows that H = HR. In other words Λ X is measurable,for each bounded linear functional on HR and hence

X : (Ω,F , P ) → (HR, E)

is measurable. This is one of the reasons we work with reproducing kernelHilbert spaces HR here. The only other spaceH for whichX is automaticallymeasurable with respect to the σ-field E(H) on H is the product spaceH = RF which is not very nice as a topological vector space.

With the measurability of X established, the distribution PX of X onthe space HR is a well defined probability measure on the σ-field E = E(HR).According to the transformation formula (4.1) we have

EPX [f ] = EP [f X],

for all PX -integrable functions f : HR → R. If PX has weak first order thenthe E(X) denotes the mean of the distribution PX . By definition it is theunique vector µ ∈ HR which satisfies

Λ(µ) = EPX (Λ) = EP (Λ X), (5.2)

for all bounded linear functionals Λ on HR.


Proposition 5.2.1. Assume that the process X has sample paths in thereproducing kernel Hilbert space HR with kernel R and let PX be the distri-bution of X on HR.(i) If PX has weak first order, then the process X is integrable and the meanE(X) ∈ HR of the distribution PX is the mean function function m(t) of X.In particular m ∈ HR.(ii) If PX has weak second order then X is square integrable, the kernel Rdominates the covariance kernel K of X and the covariance operator Θ ofPX is the domination operator L : HR → HK .

Remark. The converse in (ii) is true under the additional asumption thatthe mean function m(t) of the process X is an element of HR. See (5.2.2)below.Proof. (i) Assume that PX has weak first order and let t ∈ F . The evaluationfunctional et at the point t is a bounded linear functional on HR whichsatisfies Xt = et X and consequently

EP (|Xt|) = EPX (|et|) <∞.

Thus the process X is integrable. Let µ = E(X) be the mean of PX andapply (5.2) to the evaluation functional et to obtain

µ(t) = EP [et X] = EP (Xt) = m(t),

for all t ∈ F . (ii) Assume that PX has weak second order and let t ∈ F .Then

EP (X2t ) = EPX (e2t ) <∞.

Thus X is square integrable. For s, t ∈ F the functionals ΛRs , ΛRt on HR

are the evaluation functionals es, et at the points s, t and the definition ofthe covariance operator Θ of PX now yields

(ΘRs, Rt)R = EPX [eset]− EPX [es]EPX [et]= EP [(es X)(et X)]− EP [es X]EP [et X]= EP [XsXt]− EP (Xs)EP (Xt) = K(s, t).

According to (3.26) this implies that R dominates K and that Θ is thedomination operator.The following example taken from [JB02] shows that PX can fail to haveweak first order even if the process X is integrable of all orders p > 0. Theprocess X can have all its sample paths in the reproducing kernel Hilbert


space HR and yet the mean function m(t) = E(Xt) need not be in HR.Likewise the kernel R need not dominate the covariance kernel K of X.

This is of interest for us since we will investigate to which extent kerneldomination R K is related to the existence of a version of X with samplepaths in HR.

The example will also show that the domination R K can fail if PX

has weak first order if the assumption of weak second order in (ii) is dropped.

Example 5.2.1. Let F be the set of positive integers and R : F × F → Rthe kernel

R(s, t) =

1/s2, s = t0, s 6= t.

Then R is a symmetric, positive semidefinite kernel on F . For s ∈ F let esdenote the standard unit vector

es = (0, . . . , 0, 1, 0 . . . ) ∈ RF

with zerro in all coordinates zero except coordinate s. Then the kernelfunctions Rs are given as Rs = s−2es. Consider a finite sequence f =(f1, f2, . . . , fN , 0, . . . ) ∈ HR. Then

f =∑N

j=1j2fjRj

and consequently

‖f‖2R =∑N

i,j=1i2j2fifjR(i, j) =

∑N

j=1j2f2

j .

From this it follows that the reproducing kernel Hilbert space HR is thespace

HR = f = (fj) ∈ RF :∑

jj2f2

j <∞

endowed with the inner product

(f, g)R =∑

jj2fjgj , f = (fj), g = (gj) ∈ HR.

The probability space (Ω,F , P ) is defined as follows: Ω = F is the set ofpositive integers, F the σ-field of all subsets of Ω and the probability P isspecified on atoms as

P (i) = qi,

where q = (qi) is a sequence of positive numbers summing to one. Thissequence will be chosen below to achieve desired effects. The process X with


index set F is defined to be the sequence X = (Xi) of random variables on(Ω,F , P ) defined as

Xi(t) =

1, t = i0, t 6= i.

For each i ∈ Ω the sample path X(i) : t ∈ F 7→ Xi(t) ∈ R is simply thestandard vector X(i) = ei. Consequently the process X has sample paths inthe reproducing kernel Hilbert space HR. The distribution PX of X on HR

is concentrated on the countable subset ei : i ≥ 1 ⊆ HR with probabilities

PX(ei) = P (i) = qi.

Let s, t ∈ F and note that |Xt|p = Xt and XsXt = Xs, if s = t, whileXsXt = 0, if s 6= t. From this it follows that E(|Xt|p) = qt < ∞, for allp > 0. The mean function m of X is the function

m(t) = qt, t ∈ F.

The covariance kernel K of X is given by

K(s, t) =qs − q2s , s = t−qsqt, s 6= t.

Now if R dominates K we must have K(t, t) ≤ C2R(t, t), t ∈ F , for someconstant C, that is,

qt − q2t ≤ C2/t2, t ∈ F. (5.3)

Let Λ be any bounded linear functional on HR and f ∈ HR such thatΛ(·) = (·, f)R. Recall that PX is concentrated on the set ei : i ≥ 1 ⊆ HR

and that ei = i2Ri and so (h, ei)R = i2h(i) = i2hi, for all h ∈ HR. Thus wehave ∫

HR

|Λ(h)|pPX(dh) =∫

HR

|(h, f)R|pPX(dh) =∑

iqi|(ei, f)R|p

=∑

iqii

2p|fi|p,

for all p > 0. To ensure that PX has weak order p we have to choose the qiso that this quantity is finite for all f ∈ HR, that is, for all sequence f = (fi)satisfying

∑i i

2|fi|2 <∞.(A) Set qi = i−3/2. Then the mean function m(t) is not in HR and conse-quently PX cannot have weak first order (5.2.1 (i)) even though the process


X is integrable of order p, for all p > 0. The condition (5.3) fails also andthus R does not dominate K even though the process X has sample pathsin HR.(B) Now set qi = i−5/3. Then (5.3) fails again and so R does not dominateK. Thus PX cannot have weak second order (5.2.1 (ii)). We claim that PX

does have weak first order. To see this we must show that∑iqii

2|fi| =∑

ii1/3|fi| <∞,

for all sequences f = (fi) satisfying∑

i i2|fi|2 < ∞. This follows from the

Cauchy-Schwartz inequality if we write i1/3|fi| = i−2/3(i|fi|).

We now establish the converse in (ii) 5.2.1 under the additional asumptionthat the mean function m(t) of the process X satisfies m ∈ HR.

Proposition 5.2.2. Assume that the integrable process X has sample pathsin the reproducing kernel Hilbert space HR with kernel R and that the meanfunction m(t) = E(Xt) satisfies m ∈ HR.

Then the distribution PX of X on HR has weak second order if and onlyif the kernel R dominates the covariance kernel K of X.

Proof. For f ∈ HR let Λf be the bounded linear functional Λf (·) = (·, f)Ras usual. If S ⊆ F is a finite subset, a : S → R any coefficients andf =

∑s∈S a(s)Rs, then

Λf =∑

s∈Sa(s)es, (5.4)

where es : h ∈ HR 7→ h(s) ∈ R is the evaluation functional at the points ∈ F .

We have already seen that R dominates K if PX has weak second order(5.2.1 (ii)). Let us now prove the converse. Assume that R dominates Kand let C be a constant satisfying∑


∑s,t∈S

a(s)a(t)R(s, t),

for all finite subsets S ⊆ F and coefficients a : S → R (3.6.2). We mustshow that the distribution PX on HR has weak second order, that is, wehave

Λg ∈ L2(PX), for all g ∈ HR.

The process Yt = Xt −m(t) has sample paths in HR and distribution PY =−m+PX with mean zero. Replacing X with the process Y we may assumethat E(Xt) = 0 and hence K(s, t) = E(XsXt), for all s, t ∈ F .


Let now V be the linear span of the kernel functions Rs in HR and letf ∈ V . Then f =

∑s∈S a(s)Rs for some finite subset S ⊆ F and coefficients

a : S → R. Using (5.4) we have

Λf X =∑

s∈Sa(s)Xs

and it follows that

‖Λf‖2L2(PX) = EP(|Λf X|2

)= EP

(∑s,t∈S

a(s)a(t)XsXt

)=

∑s,t∈S

a(s)a(t)K(s, t)

≤ C2∑

s,t∈Sa(s)a(t)R(s, t) = C2 ‖f‖2R <∞.

This shows that the linear map J0 : f ∈ V 7→ Λf maps into L2(PX) andis continuous. Consequently J0 extends to a unique bounded linear mapJ : HR → L2(PX) and it will thus suffice to show that Jg = Λg, for allg ∈ HR.

Let g ∈ HR and choose a sequence (fn) ⊆ V such that ‖fn − g‖R → 0,as n ↑ ∞. Then Λfn → Λg in norm and hence pointwise on HR.

On the other hand we have Λfn = J(fn) → Jg in L2(PX) and thus asubsequence converges pointwise PX -almost surely on HR. It follows thatΛg = Jg ∈ L2(PX).

Recall that F is a set, R a symmetric, positive semidefinite kernel on F ,HR the reproducing kernel Hilbert space with kernel R and X = (Xt)t∈F

a stochastic process on F , based on the probability space (Ω,F , P ), thatis, Xt : (Ω,F , P ) → R is a random variable, for each t ∈ F . A stochasticprocess Y on F based on the same probability space (Ω,F , P ) is called aversion of X if we have

Xt = Yt, P almost surely, for each t ∈ F.

Theorem 5.2.1. Assume that X is a square integrable process with meanfunction m ∈ HR and covariance kernel K on F . If R K, then thereis a version Y of X such that all sample paths of Y are in the reproducingkernel Hilbert space HR and the distribution PY of Y on HR has strongsecond order.

Proof. Replacing the process X with Xt − m(t) we may assume that themean function is m(t) = E(Xt) = 0, t ∈ F . It follows that the covariance


kernel K of X satisfies

K(s, t) = E(XsXt), s, t ∈ F.

According to 3.6.4 there exists a symmetric, positive semidefinite kernel Qon F such that R Q K and the reproducing kernel Hilbert space HQ

is separable. Replacing R with Q we may assume that HR is separable.Then there is a countable subset T = sn : n ≥ 1 ⊆ F such that the

kernel functions Rs, s ∈ T , are linearly independent and the span

VT = span (Rs : s ∈ T ) ⊆ HR

is dense in HR. See 3.5.6. Let HT be the reproducing kernel Hilbert spaceon the set T with kernel R restricted to T . We claim that the sample pathsX(ω) = (Xs(ω))s∈F satisfy

X(ω)|T ∈ HT ,

for almost every ω ∈ Ω. Let F denote the family of all finite subsets S ⊆ Tand S ∈ F . Since S is finite and the kernel functions Rs, s ∈ S, linearlyindependent it follows that X(ω)|S ∈ HS, see (3.5.4). It will thus suffice toshow that

supS∈F ‖X(ω)|S‖R <∞,

for almost every ω ∈ Ω. If S1, S2 ∈ F with S1 ⊆ S2 then the restrictionmap I : f ∈ HS2 7→ f |S1 ∈ HS1 is a contraction. From this it follows thatwe have to consider only the sets

Sn = s1, s2, . . . , sn ∈ F

in the computation of the supremum above. Fix n. For each ω ∈ Ω we have

Zn(ω) := ‖X(ω)|Sn‖2R =

∑s,t∈Sn

Xs(ω)Xt(ω)R−1n (s, t), (5.5)

where Rn is the matrix (R(s, t)s,t∈Sn), see (3.5.4). The Zn are nondecreasingand hence Zn ↑ Z for some random variable Z ≥ 0 on Ω. Note that

E(Zn) =∑

s,t∈SnK(s, t)R−1

n (s, t) = Tr(KnR−1n ) ≤ Tr(L),

where Kn is the matrix (K(s, t)s,t∈Sn) and L : HR → HK the dominationoperator, see 3.6.5. By monotone convergence we have

E(Z) = limnE(Zn) ≤ Tr(L) <∞


and consequentlysupn ‖X(ω)|Sn‖2

R = Z(ω) <∞,

for almost every ω ∈ Ω. Consider such ω. Then the sample path X(ω)|T isin HT and consequently there exists a unique function fω ∈ HR such thatfω(t) = X(ω)(t) = Xt(ω), for all t ∈ T , see 3.5.5.

For ω in the remaining null set set fω = 0 and define the process Y onF as follows

Ys(ω) = fω(s), ω ∈ Ω, s ∈ F.

Then we have

Yt = Xt, almost surely, for each t ∈ T. (5.6)

We claim that Y is a stochastic process on F , that is, Ys : (Ω,F , P ) → Rmeasurable, for each s ∈ F . This is certainly true for s ∈ T , since thenYt = Xt almost surely and the probability space (Ω,F , P ) is assumed to becomplete.

Note that Y is defined as a map Y : Ω → HR. The set of all f ∈ HR

such that Yf := (Y, f)R : Ω → R is measurable is a closed linear subspace ofHR which contains all the kernel functions f = Rt, t ∈ T , and hence everyf ∈ HR. Applying this to f = Rs shows that Ys is measurable, for eachs ∈ F .

It follows that the distribution PY of Y on HR is well defined. We claimthat PY has strong second order, that is,∫

HR

‖f‖2K PY (df) = EP(‖Y ‖2

K

)<∞.

For each ω ∈ Ω we have Y |Sn = X|Sn , for all n ≥ 1, and so, using 3.5.6,

‖Y (ω)‖2K = limn ‖Y |Sn(ω)‖2

K = limn ‖X|Sn(ω)‖2K = Zn(ω).

By monotone convergence we have

EP(‖Y ‖2K

)= limnE(Zn) ≤ Tr(L) <∞.

Thus PY has strong and hence weak second order. It follows that the processYt is square integrable. It remains to be shown only that

Ys = Xs almost surely, for each s ∈ F.


Recall that L2X and L2

Y are the closed linear span in L2(Ω,F , P ) of the Xs

respectively the Ys, s ∈ F , and that we have the Loeve isomorphisms

IX : L2X → HK and IY : L2

Y → HK

satisfying IX(Xs) = IY (Ys) = Ks, for all s ∈ F . Now the span of the kernelfunctions Rt, t ∈ T , is dense in HR and we have the domination operator

L : HR → HK

which is continuous and satisfies LRs = Ks, for all s ∈ F , and consequentlyhas dense range in HK .

It follows that the span of the Kt, t ∈ T , is dense in HK and an applica-tion of the Loeve isomorphisms shows that L2

X and L2Y are the closed linear

span of the Xt respectively Yt with t ∈ T . But for t ∈ T we have Xt = Yt inL2(Ω,F , P ). It follows that L2

X = L2Y . Moreover, since

IX(Xt) = Kt = IY (Xt), ∀t ∈ T,

we have IX = IY and it follows that

Ys = I−1Y (Ks) = I−1

X (Ks) = Xs in L2(Ω,F , P ),

for all s ∈ F . Thus Y is a version of X.

Proposition 5.2.3. Assume that X and Y are stochastic processes on Fwith sample paths in a reproducing kernel Hilbert space H with kernel R onthe set F . If Y is a version of X, then PY = PX .

Remark. If H is a general Hilbert space, the distribution PX of X on H neednot be defined.Proof. It will suffice to show that PX and PY agree on cylinder sets, equiv-alently, that the random vector

W = (Λf1 , . . . ,Λfn)

has the same distribution under PX and PY , for all n ≥ 1 and f1, . . . , fn ∈ H.Under PX the distribution of W is the distribution of the random vector

(Λf1 X, . . . ,Λfn X)

under the underlying probability P . Under PY the distribution of W is thedistribution of the random vector

(Λf1 Y, . . . ,Λfn Y )


under P . It will thus suffice to show that

Λf X = Λf Y, P -almost surely, (5.7)

for each f ∈ H. The set V of all f ∈ H for which this is true is a closedsubspace of H which contains all the kernel functions f = Rt, t ∈ F . Itfollows that V = H.

Corollary 5.2.1. Let X be a square integrable process with sample paths inthe kernel reproducing Hilbert space H with kernel R on the set F and let Kbe the covariance kernel of X on F . If R K, then the distribution PX ofX on H has strong second order.

Proof. According to Theorem 5.2.1 X has a version Y with sample paths inH and distribution PY on H which has strong second order. But since Y isa version of X we have PY = PX .

Chapter 6

Gaussian random fields

This chapter investigates the class of square integrable processes with theproperty that all finite dimensional marginal distributions are multinormal.

6.1 Definition and construction

Let F ⊆ Rn be a closed subset. A Gaussian random field Z on F is acollection of random variables Z(x), x ∈ F , on a probability space (Ω,F , P )with the following property:

For each finite set of points x1, . . . , xk ∈ F the distribution of the randomvector (Z(x1), . . . , Z(xk)) is multinormal in Rk. Note that this is equivalentwith the requirement that each linear combination

a1Z(x1) + a2Z(x2) + · · ·+ akZ(xk)

be a normal random variable. Since normal random variables have momentsof all orders the process Z is integrable of order p, for all p > 0. In particularthe mean function m(x) = E(Z(x)) and the covariance kernel K of Z on F

K(x, y) = E[Z(x)Z(y)]−m(x)m(y), x, y ∈ F,

are well defined and K is a symmetric, positive semidefinite kernel on F .We can view Z as a random variable with values in the vector space RF

of all functions f : F → R: for each ω ∈ Ω let Z(ω) : F → R be the function

x ∈ F 7→ [Z(x)](ω) ∈ R.

Then Z(ω) is called the is called the sample path of Z at ω. With this wecan view Z as a map

Z : ω ∈ Ω 7→ Z(ω) ∈ RF

101

102 CHAPTER 6. GAUSSIAN RANDOM FIELDS

and this map is measurable with respect to the σ-field E generated by theevaluation functionals

Ex : f ∈ RF 7→ f(x) ∈ R, x ∈ F.

Then the distribution PZ of Z on RF is defined (on E) and is a Gaussianmeasure on RF with definitions in complete analogy to the case of Gaussianmeasures on a Hilbert space. Conversely every Gaussian measure P on RF

is the distribution of the process

Z(x) = Ex, x ∈ F,

viewed as a random variable on the probability space (RF , E , P ). The samplepaths of this process Z are obviously exactly the elements f ∈ RF , that is,all functions f : F → R.

Thus we have a complete correspondence between Gaussian processeson F and Gaussian measures on RF . Unfortunately this correspondence isnot of much use. The product space RF is far too large to have interestingproperties.

Instead we are interested in Hilbert spaces H ⊆ RF such that Z hasall its sample paths in H. If this is the case and H is a reproducing kernelHilbert space on F , then the map

Z : ω ∈ Ω 7→ Z(ω) ∈ H

is automatically measurable for the σ-field E generated by the continuouslinear functionals on H. Thus the distribution PZ of Z on H is well defined.

For the remainder of this section assume that Z is a Gaussian process on(indexed by) the set F with covariance kernel K and H = HR a reproducingkernel Hilbert space on F with kernel R.

Proposition 6.1.1. Assume that all sample paths of Z are in H. Thenthe distribution PZ of Z on H is a Gaussian measure on H and the meanfunction m(x) = E(Z(x)) is the mean of the distribution PZ and so inparticular m ∈ H.

Proof. We must show that the random variable Λf (·) = (·, f) is normal underPZ , equivalently that Λf Z is normal under the probability P underlyingthe Gaussian process Z, for each f ∈ H.

Let R denote the kernel of H on F . Then the kernel functions Rx, x ∈ F ,have dense linear span in H. If x ∈ F and f = Rx, then Λf Z = Z(x) isnormal under P , since Z is a Gaussian process. Linear combinations and

6.1. DEFINITION AND CONSTRUCTION 103

pointwise limits of normal random variables are again normal random vari-ables. Thus the set V of f ∈ H such that Λf Z is a normal random variable,is a closed linear subspace of H which contains all the kernel functions Rx.It follows that V = H.

Since a Gaussian measure has weak first order Proposition 5.2.1 impliesthat the mean function m(x) = E(Z(x)), x ∈ F , is the mean of PZ in H.

Proposition 6.1.2. Assume that all sample paths of Z are in H. Then wehave nuclear domination R K, the domination operator L : HR → HK isthe covariance operator Q of PZ and PZ has strong second order.

Proof. The Gaussian measure PZ has weak second order. According toProposition 5.2.1 R dominates the covariance kernel K of Z and the dom-ination operator L is the covariance operator of PZ . But we have alreadyseen that the covariance operator of a Gaussian measure is a trace classoperator. Thus we have nuclear dominance R K.

Theorem 6.1.1. Assume that the mean function m of the Gaussian processZ satisfies m ∈ H. Then Z has a version with sample paths in the repro-ducing kernel Hilbert space HR if and only if R dominates the covariancekernel K of Z in a nuclear fashion.

Proof. If R K, then Z has a version with sample paths in H according toTheorem 5.2.1. This does not make use of the Gaussian assumption for Z.The converse follows from the preceeding proposition if we note that eachversion of Z is again a Gaussian process.

Remark 6.1.1. Let Z be a Gaussian process on the set F with meanfunction m ∈ HR. Then one can show:If R 6 K, then the sample paths of Z are in HR with probability zero.See [JB02], Theorem 7.3 and other interesting results in this paper.

6.1.1 Construction of Gaussian random fields

Given a symmetric, positive semidefinite kernel on the set F we investigatethe construction of Gaussian processes Z on F with mean function m(x) =E(Z(x)) = 0 and covariance kernel

E[Z(x)Z(y)] = K(x, y), x, y ∈ F.


Karhunen-Loeve expansion

The reproducing kernel Hilbert space HK with kernel K has already enteredthe picture. Let ψj : j ≥ 1 be an orthonormal basis of HK and (Zj) asequence of independent, standard normal random variables on some prob-ability space (Ω,F , P ). Recall from ((3.17) that we have the bilinear kernelexpansion

K(x, y) =∑

jψj(x)ψj(y) (6.1)

convergent at each point (x, y) ∈ F × F . Let X, Y be square integrablerandom variables on (Ω,F , P ) with mean zero. Then V ar(X) = ‖X‖2

2.Morever independence of X and Y implies orthogonality in L2(Ω,F , P ):

(X,Y )L2 =∫

Ω

XY dP = E(XY ) = E(X)E(Y ) = 0.

Thus the Zj are orthonormal in L2(Ω,F , P ). We claim that the series

Z(x) =∑

jψj(x)Zj (6.2)

converges in L2(Ω,F , P ) for each point x ∈ X and defines a centered Gaus-sian random field Z with covariance kernel K. Let x ∈ X. From the bilinearkernel expansion (6.1) we have∑

jψ2

j (x) = K(x, x) <∞

and since the sequence (Zj) is orthonormal in L2(Ω,F , P ) the expansion(6.2) converges in the norm of L2(Ω,F , P ).

Linear combinations and L2-limits of normal variables with mean zeroare again such variables we see that Z(x) and indeed any finite linear com-bination of the Z(x) is a normal variable with mean zero.

Thus Z is a Gaussian random field onX. Again by L2-convergence of theexpansion (6.1) and orthonormality of the Zj in L2(Ω,F , P ) the covariancekernel of Z is computed as

Cov[Z(x)Z(y)] =∑

jψj(x)ψj(y) = K(x, y).

The expansion (6.2) is called the Karhunen-Loeve expansion of Z and isuseful if we want to simulate the process Z (the series is then cut off atsome point). It does not however present the random field Z as an HK-valued random variable

Z : ω ∈ Ω 7→ Z(ω) =∑

jZj(ω)ψj ∈ H


since the series on the right diverges in HK almost surely. Consequently thedistribution of Z does not live on HK . Indeed we have no Gaussian measureon HK which corresponds to the process Z. To obain such a measure wehave to pass from the kernel K to its (operator) square root:

Gaussian measure construction

If the kernel K is suitably well behaved we can employ another constructionwhich does yield some information about the sample paths of the processZ. Assume that F is a closed subset of Rd and that K is a symmetric,continuous and positive semidefinite kernel on F which satisfies∫

FK(x, x)dx <∞.

We will now construct a Gaussian random field Z with covariance kernel Kfrom a Gaussian measure on a suitable reproducing kernel Hilbert space H.The kernel L of H will be the operator square root of the kernel K.

Let T = TK : L2(X) → L2(X) be the integral operator with kernel Kand write

T =∑

iλi(φi ⊗ φi),

where (φi) ⊆ L2(F ) is an orthonormal sequence of eigenvectors of T andthe λi > 0 are the corresponding eigenvalues. We then have the bilinearexpansion

K(x, y) =∑

iλiφi(x)φi(y)

convergent uniformly on compact subsets of F ×F , see 3.4.2. The operatorT is a trace class operator, that is,

∑i λi < ∞. Consequently the positive

square rootT 1/2 =

∑i

√λi(φi ⊗ φi)

is a Hilbert-Schmidt operator and so there exists a square integrable kernelL on F such that T 1/2 = TL. In fact L is given as

L(x, y) =∑

i

√λiφi(x)φi(y) (6.3)

The series converges in L2(F × F ) since the functions ζj(x, y) = φi(x)φi(y)are orthonormal in L2(F × F ). Thus L is the operator square root of thekernel K. We have ∑

iλiφ

2i (x) = K(x, x) <∞


and hence the kernel funtions Lx satisfy

Lx =∑

i

√λiφi(x)φi

with convergence in L2(F ). From this it follows that

(Lx, Ly)L2 =∑

iλiφi(x)φi(y) = K(x, y). (6.4)

The L2-convergence of the series above does not allow us to conclude thatL is another Mercer kernel on F . We need to assume that

(i)∑

i

√λi <∞.

(ii) The expansion (6.3) converges pointwise on F × F .(iii) The sum L(x, y) is continuous.

The orthonormality of the φi in L2(F ) and (i) imply∫F L(x, x)dx <∞ and

so L is a Mercer kernel on F .Now let H = HL be the kernel reproducing Hilbert space, T = TL the

integral operator with kernel L and Q the restriction of TL to H ⊆ L2(F ).Then Q : H → H is a trace class operator and hence there exists a uniqueGaussian measure P on H with mean zero and covariance operator Q. Weclaim

Proposition 6.1.3. For x ∈ F let Z(x) be the evaluation functional at thepoint x operating on H = HL. Then Z = (Z(x))x∈F is a Gaussian process onthe probability space (H, E , P ) with covariance function K(x, y). The samplepaths of Z are exactly the functions in H.

Proof. Let x ∈ F . The reproducing property of the kernel function Lx

implies that[Z(x)](f) = f(x) = (f, Lx)L,

that is, Z(x) is the functional ΛLx on H. Thus Z(x) is normal with meanzero and we have

EP [Z(x)Z(y)] = (QLx, Ly)H = (Lx, Ly)L2 = K(x, y).

Here we have used (6.4) and the property (3.33) of the inner product inthe reproducing kernel Hilbert space H. Since the Z(x) are the evaluationfunctionals, the sample path

x ∈ F 7→ [Z(x)](f) = f(x)


at f ∈ H is exactly the function f itself. Thus it remains to be shown onlythat Z is a Gaussian process:

Let x1, . . . , xd ∈ F . We have to show that the vector (Z(x1), . . . , Z(xd))is multinormal in Rd. Let a1, . . . , ad ∈ R and note that

a1Z(x1) + · · · + adZ(xd) = Λg,

where g = a1kx1+· · ·+adkxd∈ H and hence is a normal variable by definition

of a Gaussian measure.

Remark. To use this to draw conclusions about the sample paths of Z weneed detailed information about the eigenvalues λi and eigenfunctions φi ofthe integral operator TK . Since TL is the operator

TL =∑

i

√λi(φi ⊗ φi),

H consists of all functions f ∈ L2(F ) such that∑jλ−1/2j |(f, φj)2 |2 <∞,

see the second description (3.38) of the norm on a reproducing kernel Hilbertspace. Using the expansion

f =∑

j(f, φj)2φj

in L2(F ) we need rapid convergence λj → 0 to be able to inherit smoothnessproperties from the φj to all functions f ∈ H.

Appendix A

Vector Valued Integration

We need only the very simplest facts about Hilbert space valued integrals.Let (Ω,F , P ) be a finite measure space and X a Banach space. A functionf : Ω → X is called Borel measurable if it is measurable with respect to theBorel σ-field on X. It then follows that Λ f : Ω → R is Borel measurablefor each continuous function Λ : X → R.

A Borel measurable function f : Ω → X is called weakly integrable ifthere exists an element a ∈ X such that

Λ(a) =∫

ΩΛ(f(ω))P(dω)

for all continuous linear functionals Λ on X. In this case the element a ∈ Xis uniquely determined and called the integral of f in X denoted

a =∫

ΩfdP ∈ X.

The defining property of this integral is that it commutes with all continuouslinear functionals Λ on X:

Λ(∫

ΩfdP

)=

∫Ω(Λ f)dP. (A.1)

The Borel measurable function f : Ω → X is called Bochner integrable if∫Ω‖f(ω)‖P(dω) <∞,

where ‖·‖ denotes the norm onX. In this case theX-valued integral is callednorm convergent. It then follows that f is weakly integrable in particular

109

110 APPENDIX A. VECTOR VALUED INTEGRATION

there is a unique element x ∈ X which is the integral of f in the sense above.The proof is particularly simple if X is a Hilbert space and this is the onlycase which we need:

Proposition A.0.4. Let X be a Hilbert space and f : Ω → X Bochnerintegrable. Then f is weakly integrable and∥∥∥∥∫

Ωf dP

∥∥∥∥ ≤ ∫Ω‖f‖ dP.

Proof. Assume that

C =∫

Ω‖f‖ dP <∞.

Let (·, ·) denote the inner product on X. For each x ∈ X, the functionω ∈ Ω 7→ (f(ω), x) ∈ R is absolutely integrable: using the Cauchy-Schwartzinequality we have ∫

Ω|(f(ω), x)|P(dω) ≤ C ‖x‖ <∞. (A.2)

Now define the functional Ψ : X → R as

Ψ(x) =∫

Ω(f(ω), x)P(dω), x ∈ X.

It is clear that Ψ is a linear functional on X and from (A.2) it follows thatΨ is continuous. By the Riesz representation theorem for continuous linearfunctionals on a Hilbert space there exists an element a ∈ X such thatΨ(x) = (a, x), for all x ∈ X, that is,

(a, x) =∫

Ω(f(ω), x)P(dω), x ∈ X.

By the Riesz representation theorem this is exactly (A.1) in the case of aHilbert space X. Using (A.2) we have |(a, x)| ≤ C ‖x‖ and letting x = a itfollows that ∥∥∥∥∫

ΩfdP

∥∥∥∥ = ‖a‖ ≤ C =∫

Ω‖f‖ dP.

Appendix B

Conditioning of multinormalRandom Vectors

The distribution of a multinormal vector Y is completely determined by itsmean and covariance matrix C which is symmetric and positive semidefinite.This section treats the conditioning of Y on one or several of its components.We assume that the covariance matrix has full rank and hence is positivedefinite.

First we derive a representation of Y in terms of a standard normalvector Z = (Z1, Z2, . . . , Zn), that is a vector Z with independent standardnormal components Zj:

E(Z2j ) = 1, E(ZiZj) = 0, i 6= j. (B.1)

All vectors Y , Z are viewed as column vectors. We only need the case wherethe mean is zero. The generalization to arbitrary means is trivial.

Proposition B.0.5. Let C = RR′ be the Cholesky factorization of the co-variance matrix C, that is, R is lower triangular and R′ denotes the trans-pose of R. If Z is a standard normal vector, then the vector

Y = RZ

is multinormal with mean zero and covariance matrix C.

Remark. Recall that the covariance matrix Cij = E[YiYj ] of a mean zerorandom vector Y can be written as C = E[Y Y ′], where the vector Y isviewed as column vector.

111

112APPENDIX B. CONDITIONING OF MULTINORMAL RANDOM VECTORS

Proof. Since Y is a linear transform of Z, Y is again multinormal. ObviouslyE(Y ) = 0. Note that Z has the covariance matrix E[ZZ ′] = I. Thecovariance matrix of Y is then given as

E[Y Y ′] = E[R(ZZ ′)R′] = RE[ZZ ′]R′ = RR′ = C.

Note that the vector Y in B.0.5 has the form

Y1 = R11Z1 (B.2)Y2 = R21Z1 +R22Z2

. . .

Yn = Rn1Z1 +Rn2Z2 + · · ·+RnnZn

This allows us to condition Y on the first n−1 components Yj , j < n. Thuswe are given the values Yj = yj for j < n and seek the distribution of Yn

given this information.Note that conditioning on Yj, j < n, is equivalent to conditioning on Zj ,

j < n, and this has no influence on Zn, since the Zj are independent. Thusall we need to do is substitute yj in place of Yj in the first n− 1 equations,solve these equations for Z1, . . . , Zn−1 and enter the solutions Z1, . . . , Zn−1

into the last equation to obtain

Yn = (Rn1Z1 + · · ·+Rn,n−1Zn−1) +RnnZn

and it follows that Yn is normal with conditional mean and variance

E(Yn) = Rn1Z1 + · · ·+Rn,n−1Zn−1 and V ar(Yn) = R2nn.

Appendix C

Orthogonal polynomials

A kernel of the form

K(x, y) =∑N

j=1ψj(x)ψj(y), x, y ∈ X.

leads to a finite dimensional reproducing kernel Hilbert space H and can beused for regression with a Gaussian measure P on H. The idea is that weconsider only functions f which have an expansion of the form

f =∑N

j=1Aj(f)ψj

and then determine the coefficients Aj by conditioning on the data. Thisworks for all sets X and the only condition on the functions ψj on X is thatthey must be linearly independent. In this case they are automatically anorthonormal basis in H.

Assume now that X is some closed rectangle in Rd. We then preferfunctions ψj which are orthonormal in L2(X). Thus it is useful to reviewthe basics of orthogonal polynomials.

Let P be the vector space of polynomials in one variable with real coef-ficients and (·, ·) an inner product on P. We assume that the inner productsatisfies

(xP (x), Q(x)) = (P (x), xQ(x)), P,Q ∈ P. (C.1)

This is the case for all inner products of the form

(P (x), Q(x)) =∫

XP (x)Q(x)µ(dx),

where µ is a positive measure on X. A standard orthonormal basis of P isan orthonormal sequence (Pn) of polynomials such that

degree(Pk) = k,

113

114 APPENDIX C. ORTHOGONAL POLYNOMIALS

for all k ≥ 0. It then follows that

span(P0, P1, . . . , Pn) = span( 1, x, x2, . . . , xn),

for all n ≥ 0. We assume that the norm ‖ · ‖ on P is normalized so that‖1‖ = 1. We claim

Proposition C.0.6. A standard orthonormal basis (Pn) of P is uniquelydetermined by the inner product on P and the polynomials Pn satisfy arecursion of the form

P0 = 1xP0(x) = λ0xP1(x) + µ0P0(x)xPn(x) = λnxPn+1(x) + µnPn(x) + γnPn−1(x) (C.2)

with scalars λn,µn,γn.

Proof. P0 = 1 by normalization of the norm on P. The second equation ishandled like the third. Note that xPn(x) ∈ span(P0, P1, . . . , Pn+1) and so

xPn(x) = λnPn+1(x) + µnPn(x) + · · · + ζnP0(x).

However for k < n− 1 we have xPk(x) ∈ span(P0, P1, . . . , Pn−1) and so

(xPn(x), Pk(x)) = (Pn(x), xPk(x)) = 0.

Thus only the first three coefficients are nonzero yielding (C.2). Takingthe inner product with Pn−1 and Pn we find that µn = (xPn, Pn) andγn = (xPn, Pn−1). Thus these coefficients are uniquely determined oncethe polynomials Pn and Pn−1 are at hand. The requirement ‖Pn+1‖ = 1then determines λn uniquely. The uniqueness of the sequence (Pn) nowfollows by induction.

C.0.2 Legendre polynomials

We now investigate the sequnence (Pn) which is orthonormal on the intervalX = [−1, 1] with inner product

(P,Q) =12

∫XP (x)Q(x)dx.

115

Because of the normalization with the factor 1/2 the polynomial P0 = 1 hasunit norm. it is not difficult to compute the first few of the polynomials Pn:

P0(x) = 1P1(x) =

√3x

P2(x) =√

52

(3x2 − 1)x

P3(x) =√

72

(5x3 − 3x)x

Let us now investigate what form the recursion (C.2) assumes in the caseof these polynomials. The terms even and odd when applied to a functionf on X have the usual meaning f(−x) = f(x) and f(−x) = −f(x). Notethat a product of two odd functions and a product of two even functions iseven while a product of an odd and an even function is odd. Because of thesymmetry of the interval X the integral of an odd function is zero while theinterval of an even function is twice the integral over the interval [0, 1].

Note that the sequence Pn is orthonormal by assumption and satisfiesthe recursion (C.2). We claim that(a) Pn(x) is even for even n and odd for odd n and we have µn−1 = 0.Proof. Induction on n. By inspection the claim is true for n = 1, 2, 3. Assumeit is true for all k ≤ n and consider the polynomial Pn+1. Taking the innerproduct with Pn in (C.2) we obtain

2µn = 2(xPn, Pn) =∫

XxP 2

n(x)dx = 0,

since the function P 2n is even in all cases. From this

λnPn+1 = xPn − λn−1Pn−1.

If n is even, then xPn(x) is odd and so is Pn−1(x) and consequently Pn+1(x)is odd. Likewise, if n is odd, then xPn(x) is even and so is Pn−1(x) and itfollows that Pn+1(x) is even.Next we note that

γn = (xPn, Pn−1) = (Pn, xPn−1) = (Pn, λn−1Pn + γn−1Pn−2) = λn−1

and so we have the simpler recursion

xPn = λnPn+1 + λn−1Pn−1. (C.3)


Note now that Pn ⊥ span(P0, P1, . . . , Pn−1) = span(1, x, . . . , xn−1)while (Pn, x

n) 6= 0 (or else (Pn, Pn) = 0) and set

fn = Pn(1).

We now have to use the fact that our inner product is based on Lebesguemeasure. We do this by using integration by parts. Note that the expression

f(x)∣∣∣∣+1

−1

:= f(1)− f(−1)

vanishes if f is even while it is 2f(1) if f is odd. Note also that (Pn, Q) = 0, ifthe polynomialQ has degree less than n, while (Pn, Q) 6= 0, if the polynomialQ has degree n. With this we compute

2(P ′n+1, Pn) =∫

XP ′n+1, Pn = Pn+1Pn

∣∣∣∣+1

−1

−∫

XPn+1Pn (C.4)

= 2fnfn+1 (C.5)

since PnPn+1 is odd and the integral on the right is zero. Similarly

2(xP ′n, Pn) = 2(xP ′n + Pn, Pn)− 2 = 2((xPn)′, Pn)− 2

=∫

X

(xPn)′Pn = xP 2n(x)

∣∣∣∣+1

−1

−∫

X

xPnP′n

= −2 + 2f2n − 2(xP ′n, Pn).

Moving the inner product from the right to the left it follows that

(xP ′n, Pn) =12(f2

n − 1). (C.6)

But on the other hand (using C.4)

(xP ′n, Pn) = (P ′n, xPn) = (P ′n, λnPn+1 + λn−1Pn−1) = λn−1(P ′n, Pn−1)= λn−1fn−1fn

and soλn−1fn−1fn =

12(f2

n − 1). (C.7)

From (C.4) we have (λnP′n+1, Pn) = λnfnfn+1.

λnP′n+1 = (xPn)′ − λn−1P

′n−1 = Pn + xP ′n − λn−1P

′n−1, thus

(λn−1P′n−1, Pn)1 + (xP ′n, Pn) = 1 +

12(f2

n − 1) =12(1 + f2

n)

117

and soλnfnfn+1 =

12(1 + f2

n). (C.8)

Putting (C.7) and (C.8) together we see that

12(1 + f2

n) = λnfnfn+1 =12(f2

n+1 − 1).

Thus f2n+1 = f2

n + 2 and since f21 = 3 we have

f2n = 2n+ 1.

From this we get

λn =12f2

n+1 − 1fnfn+1

=n+ 1√

2n+ 1√

2n + 3

which allows us to rewrite the recursion Pn+1 = λ−1n xPn − λ−1

n λn1Pn−1 as

Pn+1(x) =

√(2n + 1)(2n + 3)

n+ 1xPn(x)− n

√2n+ 3

(n+ 1)√

2n − 1Pn−1(x). (C.9)

Note that thius differs from the usual simpler recursion since our Legendrepolynomials are orthonormal while the usual recursion yields orthogonalpolynomials which are not normalized.

Bibliography

[Bog98] V.I. Bogachev. Gaussian Measures. AMS, 1998.

[Dri73] Michael F. Driscoll. The reproducing kernel Hilbert space struc-ture of the sample paths of a gaussian process. Zeitschriftfuer Wahrscheinlichkeitstheorie und verwandte Gebiete, 26:309–316, 1973.

[GK71] G. Wahba G.S. Kimmeldorff. Some results on Tchebycheffian splinefunctions. J. Math. Anal. Applications, 33:82–95, 1971.

[JB02] M.N. Lukic J.H. Beder. Stochastic processes with sample paths inkernel reproducing Hilbert spaces. Transactions AMS, 2002.

[Mac99] D.J.C. Mackay. Introduction to Gaussian processes, 1999.

[MM02] V. Matache M.T. Matache. Hilbert spaces induced by Toeplitzcovariance kernels. Stochastic Theory and Control, 280:319–333,2002.

[RA] J.E. Taylor R.J. Adler. Random fields and their geometry.

[Sch] Smola Schoelkopf, Herbrich. A general representer theorem.

119

Index

KS , 53

approximation number, 20, 21

basisorthonormal in rkhs, 62

Bochner integrable, 109

Cameron-Martin space, 79Cholesky factorization

conditioning with, 84, 111coefficients

regression, 83compact operator, 10

Hilbert space, 17selfadjoint, 14

covariance kernel, 89covariance operator, 72cylinder set, 70cylinder sets, 70

Dini’s theorem, 51domination operator, 56

evaluation functional, 82

Gaussianmeasure, 73random field, 101vectors, conditioning, 111

Gaussian measureFourier transform, 75

Hilbert-Schmidt operator, 21

integrable of order p > 0, 89integral operator, 29, 31integration

vector valued, 109inverse problem, 25

Karhunen-Loeve expansion, 103kernel, 29, 37

L2-bounded, 36bilinear expansion, 49convolution, 47degenerate, 32domination, 55Gaussian, 41Mercer, 62nuclear domination, 58one dimensional, 32positive semidefinite, 37symmetric, 34truncated, 32

kernel matrixinvertibility, 53

least squarespolynomial interpolation, 27

Legendre polynomials, 88Loeve isomorphism, 90

meanin Hilbert space, 71

mean function, 89Mercer kernel, 62Mercer’s theorem, 66

120

INDEX 121

multinormal vectorsconditioning, 111

normal equation, 27

orthogonal plynomials, 113

probability measure in Hilbert space,69

processintegrable, 89square integrable, 89

regression, 81regularization, 25Representer theorem, 62reproducing kernel Hilbert space,

37, 41convolution kernel, 47finite dimensional, 45

singular system, 17strong order p, 71

trace, 23Trace class operator, 22

weak order p, 71weakly integrable, 109

Regression With Gaussian Measures - Peoplejordan/sail/readings/archive/meyer… · We treat the...

Documents

Transcript of Regression With Gaussian Measures - Peoplejordan/sail/readings/archive/meyer… · We treat the...