Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd...

36
Kernel methods and regularization techniques for nonparametric regression: Minimax optimality and adaptation Lee H. Dicker Dean P. Foster Daniel Hsu Department of Statistics and Biostatistics Rutgers University Piscataway, NJ 08854 e-mail: [email protected] Department of Statistics Wharton School, University of Pennsylvania Philadelphia, PA 19104, USA e-mail: [email protected] Department of Computer Science Columbia University New York, NY 10027, USA e-mail: [email protected] Abstract: Regularization is an essential element of virtually all kernel methods for nonparametric regression problems. A critical factor in the effectiveness of a given ker- nel method is the type of regularization that is employed. This article compares and contrasts members from a general class of regularization techniques, which notably in- cludes ridge regression and principal component regression. We first derive risk bounds for these techniques that match the minimax rates in several settings, using recent large deviations machinery and a natural bias-variance decomposition. We then show that certain regularization techniques are more adaptable than others to favorable reg- ularity properties that the true regression function may possess. This, in particular, demonstrates a striking difference between kernel ridge regression and kernel principal component regression. Keywords and phrases: Learning theory, Principal component regression, Reproduc- ing kernel Hilbert space, Ridge regression. 1. Introduction Suppose that the observed data consists of z i =(y i , x i ), i =1,...,n, where y i ∈Y⊆ R and x i ∈X⊆ R d . Suppose further that z 1 ,..., z n ρ are iid from some probability distribution ρ on Y×X . Let ρ(·|x) denote the conditional distribution of y i given x i = x ∈X and let ρ X 1

Transcript of Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd...

Page 1: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

Kernel methods and regularization techniquesfor nonparametric regression: Minimax

optimality and adaptation

Lee H. Dicker Dean P. Foster Daniel Hsu

Department of Statistics and BiostatisticsRutgers University

Piscataway, NJ 08854e-mail: [email protected]

Department of StatisticsWharton School, University of Pennsylvania

Philadelphia, PA 19104, USAe-mail: [email protected]

Department of Computer ScienceColumbia University

New York, NY 10027, USAe-mail: [email protected]

Abstract: Regularization is an essential element of virtually all kernel methods fornonparametric regression problems. A critical factor in the effectiveness of a given ker-nel method is the type of regularization that is employed. This article compares andcontrasts members from a general class of regularization techniques, which notably in-cludes ridge regression and principal component regression. We first derive risk boundsfor these techniques that match the minimax rates in several settings, using recentlarge deviations machinery and a natural bias-variance decomposition. We then showthat certain regularization techniques are more adaptable than others to favorable reg-ularity properties that the true regression function may possess. This, in particular,demonstrates a striking difference between kernel ridge regression and kernel principalcomponent regression.

Keywords and phrases: Learning theory, Principal component regression, Reproduc-ing kernel Hilbert space, Ridge regression.

1. Introduction

Suppose that the observed data consists of zi = (yi,xi), i = 1, . . . , n, where yi ∈ Y ⊆ R andxi ∈ X ⊆ Rd. Suppose further that z1, . . . , zn ∼ ρ are iid from some probability distributionρ on Y ×X . Let ρ(·|x) denote the conditional distribution of yi given xi = x ∈ X and let ρX

1

Page 2: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 2

denote the marginal distribution of xi. Our goal is to use the available data to estimate theregression function of y on x,

f †(x) =

∫Yy dρ(y|x),

which minimizes the mean-squared prediction error∫Y×X{y − f(x)}2 dρ(y,x)

over ρX -measurable functions f : X → R. More specifically, for an estimator f define the risk

Rρ(f) = E[∫{f †(x)− f(x)}2 dρX (x)

]= E

(||f † − f ||2ρX

), (1)

where the expectation is computed over z1, . . . , zn and || · ||ρX denotes the norm on L2(ρX );

we seek estimators f which minimize Rρ(f).This is a version of the random design nonparametric regression problem. There is a vast

literature on nonparametric regression, along with a huge variety of corresponding meth-ods (e.g., Gyorfi et al., 2002; Wasserman, 2006). In this paper, we focus on regularization andkernel methods for estimating f † (here, we mean “kernel” as in reproducing kernel Hilbertspace, rather than kernel-smoothing, which is another popular approach to nonparametricregression). Most of our results apply to general regularization operators. However, our moti-vating examples are two well-known regularization techniques: Kernel ridge regression (whichwe refer to as “KRR”; KRR is also known as Tikhonov regularization) and kernel principalcomponent regression (“KPCR”; also known as spectral cut-off regularization).

Our main contribution is two-fold. First, we derive new risk bounds for a general class ofkernel-regularization methods, which includes both KRR and KPCR. These bounds implythat the corresponding regularization methods achieve the minimax rate for estimating f †

with respect to a variety of kernels and settings; this is illustrated by example in Corollaries1–4 of Section 6.1. In each example, the minimax rate is determined by the eigenvalues of thekernel operator, computed with respect to ρX . Second, we show precisely how some types ofregularization operators are able to adapt to the regularity of f †, while others are known tohave more limitations in this regard and suffer from the saturation effect (Bauer et al., 2007;Mathe, 2005; Neubauer, 1997). More specifically, we show that if f † has an especially simplifiedexpansion in the eigenbasis of the kernel operator, then some regularization estimators areable to leverage this to obtain even faster (minimax) convergence rates. As a consequenceof our adaptation results, we conclude that KPCR is fully minimax adaptive in a range ofsettings, while KRR is known to saturate. This illustrates a striking advantage that KPCRmay have over KRR in these settings.

Page 3: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 3

Related Work

Kernel ridge regression and other regularization methods have been widely studied. Indeed, itis well-known that KRR is minimax in many of the settings we consider in this paper, such asthose described in Corollaries 1–4 below (Caponnetto and De Vito, 2007; Zhang, 2005; Zhanget al., 2013). However, our risk bounds and corresponding minimax results apply to moregeneral regularization operators (not just KRR/Tikhonov regularization), which appears tobe new. Existing risk bounds for other regularization operators tend to be looser (Bauer et al.,2007) (see the comments after Theorem 1 below) or require auxiliary information (e.g., thebounds in (Caponnetto and Yao, 2010) apply to semi-supervised settings where an additionalpool of unlabeled data is available).

Others have also observed that KPCR may have advantages over KRR, as discussed above.Indeed, others have even observed that Tikhonov regularization (KRR) saturates, while spec-tral cut-off regularization (KPCR) does not (Bauer et al., 2007; Lo Gerfo et al., 2008; Mathe,2005). Our results (Propositions 4–5) extend beyond this observation to precisely quantify theadvantages of unsaturated regularization operators in terms of adaptation and minimaxity. Inother related work, Dhillon et al. (2013) have illustrated the potential advantages of KPCRover KRR in finite-dimensional problems with linear kernels; though their work is not framedin terms of saturation and general regularization operators, it essentially relies on similarconcepts.

The main engine behind the technical results in this paper is a collection of large-deviationresults for Hilbert-Schmidt operators. The required machinery is developed in Appendix A(especially Lemmas 3–5 and Corollary 5); these results build on straightforward extensions ofresults from (Tropp, 2015). Additionally, our Proposition 5, on finite-rank adaptation, relies onwell-known eigenvalue perturbation results that have been adapted to handle Hilbert-Schmidtoperators (e.g. the Davis-Kahan sin Θ theorem (Davis and Kahan, 1970)).

2. Statistical Setting and Assumptions

Our basic assumption on the distribution of z = (y,x) ∼ ρ is that the residual variance isbounded; more specifically, we assume that there exists a constant σ2 > 0 such that∫

Y{y − f †(x)}2 dρ(y|x) ≤ σ2 (2)

for almost all x ∈ X . Zhang et al. (2013) also assume (2); this assumption is slightly weakerthan the analogous assumption in (Bauer et al., 2007) (equation (1) in their paper). Notethat (2) holds if y is bounded almost surely.

Page 4: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 4

Let K : X × X → R be a symmetric positive-definite kernel function. We assume that Kis bounded, i.e., that there exists κ2 > 0 such that

supx∈X

K(x,x) ≤ κ2.

Additionally, we assume that there is a countable basis of eigenfunctions {ψj}∞j=1 ⊆ L2(ρX )and a sequence of corresponding eigenvalues t21 ≥ t22 ≥ · · · ≥ 0 such that

K(x, x) =∞∑j=1

t2jψj(x)ψj(x), x, x ∈ X (3)

and the convergence is absolute. Mercer’s theorem and various generalizations give conditionsunder which representations like (3) are known to hold (Carmeli et al., 2006); one of thesimplest examples is when X is a compact Hausdorff space, ρX is a probability measure onthe Borel sets of X , and K is continuous. Observe that

∞∑j=1

t2j =∞∑j=1

t2j

∫Xψj(x)2 dρX (x) =

∫XK(x,x) dρX (x) ≤ κ2;

in particular, {t2j} ∈ `1(N).Let H ⊆ L2(ρX ) be the reproducing kernel Hilbert space (RKHS) corresponding to K

(Aronszajn, 1950) and letφj = tjψj, j = 1, 2, . . . (4)

It follows from basic facts about RKHSs that {φj}∞j=1 is an orthonormal basis for H (ift2J > t2J+1 = 0, then {φj}Jj=1 is an orthonormal basis for H). Our main assumption on therelationship between y, x, and the kernel K is that

f † ∈ H. (5)

This is a regularity or smoothness assumption on f †. We represent f † in the bases {ψj}∞j=1

and {φj}∞j=1 as

f † =∞∑j=1

γjψj =∞∑j=1

βjφj, (6)

where βj = γj/tj. Then the assumption (5) is equivalent to

∞∑j=1

β2j =

∞∑j=1

γ2jt2j<∞.

Page 5: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 5

Many of the results in this paper can be modified, so that they apply to settings where f † /∈ H,by replacing f † with an appropriate projection of f † onto H and including an approximationerror term in the corresponding bounds. This approach leads to the study of oracle inequalities(Hsu et al., 2014; Koltchinskii, 2006; Steinwart et al., 2009; Zhang, 2005; Zhang et al., 2013),which we do not pursue in detail here. However, investigating oracle inequalities for generalregularization operators may be of interest for future research, as most existing work focuseson ridge regularization.

3. Regularization

As discussed in Section 1, our goal is find estimators f that minimize the risk (1). In thispaper, we focus on regularization-based estimators for f †. In order to precisely describe theseestimators, we require some additional notation for various operators that will be of interest,and some basic definitions from regularization theory.

3.1. Finite-Rank Operators of Interest

For x ∈ X , define Kx ∈ H by Kx(x) = K(x, x), x ∈ X , and let X = (x1, . . . ,xn)> ∈ Rn×p

and y = (y1, . . . , yn)> ∈ Rn. Additionally, define the finite-rank linear operators SX : H → Rn

and TX : H → H (both depending on X) by

SXφ = (〈φ,Kx1〉H, . . . , 〈φ,Kxn〉H)> = (φ(x1), . . . , φ(xn))>,

TXφ =1

n

n∑i=1

〈φ,Kxi〉HKxi =1

n

n∑i=1

φ(xi)Kxi ,

where 〈·, ·〉H is the inner-product on H and φ ∈ H. Let 〈·, ·〉Rn denote the normalized inner-product on Rn, defined by 〈v, v〉Rn = n−1v>v for v = (v1, . . . , vn)>, v = (v1, . . . , vn)> ∈ Rn.Then the adjoint of SX with respect to 〈·, ·〉H and 〈·, ·〉Rn , S∗X : Rn → H, is given by

S∗Xv =1

n

n∑i=1

viKxi .

Additionally, TX = S∗XSX . Finally, observe that SXS∗X : Rn → Rn is given by the n×n matrix

SXS∗X = n−1K, where K = (K(xi,xj))1≤i,j≤n; K is the kernel matrix, which is ubiquitous in

kernel methods and enables finite computation.

Page 6: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 6

3.2. Basic Definitions

A family of functions gλ : [0,∞) → [0,∞) indexed by λ > 0 is called a regularization familyif it satisfies the following three conditions:

(R1) sup0<t≤κ2 |tgλ(t)| < 1.(R2) sup0<t≤κ2 |1− tgλ(t)| ≤ 1.(R3) sup0<t≤κ2 |gλ(t)| < λ−1.

(This definition follows (Bauer et al., 2007; Engl et al., 1996), but is slightly more restrictive.)The main idea behind a regularization family is that it “looks” similar to g(t) = 1/t, but isbetter-behaved near t = 0, i.e., it is bounded by λ−1. An important quantity that is related tothe adaptability and saturation point of a regularization family (mentioned in Section 1), isthe qualification of the regularization. The qualification of the regularization family {gλ}λ>0

is defined to be the maximal ξ ≥ 0 such that

sup0<t≤κ2

|1− tgλ(t)|tξ ≤ λξ. (7)

The two primary examples of regularization considered in this paper are ridge (Tikhonov)regularization, where

gλ(t) = rλ(t) =1

t+ λ

and principal component (spectral cut-off) regularization, where

gλ(t) = sλ(t) =1

tI{t ≥ λ}. (8)

Observe that ridge regularization has qualification 1 and principal component regularizationhas qualification ∞. Another example of a regularization family is the Landweber iteration,which can be viewed as a special case of gradient descent (see, e.g., Bauer et al., 2007; LoGerfo et al., 2008; Rosasco et al., 2005). Throughout the paper, all regularization families areassumed to have qualification at least 1.

3.3. Estimators

Given a regularization family {gλ}λ>0, we define the gλ-regularized estimators for f †,

fλ = gλ(TX)S∗Xy. (9)

Here, gλ acts on the spectrum (eigenvalues) of the finite-rank operator TX . The dependenceof fλ on the regularization family is implicit; our results hold for any regularization family

Page 7: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 7

except where explicitly stated (in particular, Section 6.2). The estimators fλ are the mainfocus of this paper.

To provide some intuition behind fλ, define the positive self-adjoint operator T : H → Hby

Tφ =

∫X〈φ,Kx〉HKx dρX (x) =

∞∑j=1

t2j〈φ, φj〉Hφj, φ ∈ H.

Observe that T is a “population” version of the operator TX . Unlike TX , T often has infiniterank; however, we still might expect that

T ≈ TX (10)

for large n (where the approximation holds in some suitable sense).We also have a large-n approximation for S∗Xy. For φ ∈ H,

〈φ, S∗Xy〉H =1

n

n∑i=1

yi〈φ,Kxi〉H ≈∫Y×X

yφ(x) dρ(y,x)

=

∫Xf †(x)φ(x) dρX (x) = 〈φ, f †〉L2(ρX ) = 〈φ, Tf †〉H,

where 〈·, ·〉L2(ρX ) denotes the inner-product on L2(ρX ) and we have used (4) to obtain thelast equality. It follows that S∗Xy ≈ Tf †. Hence, to recover f † from y, it would be natural toapply the inverse of T to S∗Xy. However, T is not invertible whenever it has infinite rank, andregularization becomes necessary. We thus arrive at the chain of approximations which helpmotivate fλ:

fλ = gλ(TX)S∗Xy ≈ gλ(T )Tf † ≈ f †,

where gλ(T ) may be viewed as an approximate inverse for a suitably chosen regularizationparameter λ. The rest of the paper is devoted to deriving bounds on the risk

Rρ(fλ) = E(||f † − fλ||2ρX

)= E

{||f † − gλ(TX)S∗Xy||2ρX

}(11)

and investigating the effect of specific regularization families {gλ}λ>0 on the accuracy of fλ.

4. Preliminary Simplifications

The main results in this paper involve bounding (11). This section is devoted to deriving somepreliminary simplifications of (11).

Page 8: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 8

By (4), the L2(ρX )-norm in (11) can be converted into a norm on H; indeed, we haveRρ(fλ) = E{||T 1/2(f † − fλ)||2H}. Next, define the residuals εi = yi − f †(xi), i = 1, . . . , n, and

let ε = (ε1, . . . , εn)>. Then fλ = gλ(TX)TXf† + gλ(TX)S∗Xε and it follows that

Rρ(fλ) = E[||T 1/2{I − gλ(TX)TX}f † − T 1/2gλ(TX)S∗Xε||2H

]= E

[||T 1/2{I − gλ(TX)TX}f †||2H

]+ E

{||T 1/2gλ(TX)S∗Xε||2H

}. (12)

This is a bias/variance decomposition of the risk Rρ(fλ); the first term in (12) represents the

bias of fλ and the second term represents the variance.In order to further simplify (12), we first note that the Hilbert space H is isometric to `2(N)

via the isometry ι : H → `2(N), given by

ι :∞∑j=1

αjφj 7→ (α1, α2, . . . )> (13)

(if H is finite dimensional and t2J > t2J+1 = 0, then take 0 = αJ+1 = αJ+2 = · · · ; we take allelements of `2(N) to be infinite-dimensional column vectors). Using this equivalence, we canconvert elements of H and linear operators on H appearing in (12) into (infinite-dimensional)vectors and matrices, respectively, which we find simpler to analyze in the sequel.

Before proceeding, we introduce some more notation. Define the infinite-dimensional diag-onal matrix T = diag(t21, t

22, . . . ) and let β = (β1, β2, . . . )

> ∈ `2(N). Next define the n ×∞matrices Ψ = (ψj(xi))1≤i≤n;1≤j<∞ and Φ = ΨT = (φj(xi))1≤i≤n;1≤j<∞. Observe that

SX = Φ ◦ ι, (14)

S∗X = ι−1 ◦(

1

nΦ>

), (15)

T = ι−1 ◦ T ◦ ι. (16)

Finally, let || · || = || · ||`2(N) denote the norm on `2(N).Combining (12)–(16) yields the next proposition, which is the starting point for obtaining

risk bounds in the sections that follow.

Proposition 1. Suppose that fλ is the estimator defined in (9). Then

Rρ(fλ) = Bρ(gλ) + Vρ(gλ),

Page 9: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 9

where

Bρ(gλ) = B(n)ρ (gλ) = E

[∣∣∣∣∣∣∣∣T 1/2

{I − gλ

(1

nΦ>Φ

)1

nΦ>Φ

∣∣∣∣∣∣∣∣2],

Vρ(gλ) = V(n)ρ (gλ) =

1

n2E

{∣∣∣∣∣∣∣∣T 1/2gλ

(1

nΦ>Φ

)Φ>ε

∣∣∣∣∣∣∣∣2}.

Proposition 1 is a further simplified version of the bias/variance decomposition (12); Bρ(gλ)and Vρ(gλ) are the bias and variance terms, respectively.

5. Main Result

In this section, we give separate bounds on the bias and variance terms from Proposition1, which are then combined to yield our main result bounding Rρ(fλ). Three major sourcescontribute to our upper bound on the bias in Proposition 2 below (and similarly to our boundon the variance in Proposition 3): Two types of approximation error, and what we refer to asthe “intrinsic bias” (or “intrinsic variance”).

The two types of approximation error correspond to finite- and infinite-dimensional compo-nents, which arise from the approximation (10). The intrinsic bias and variance are determinedby the regularization family {gλ}λ>0 and the specific regularization parameter λ used to com-pute fλ. A schematic for our risk bound decomposition may be found in Figure 1. Note,however, that while the bias/variance decomposition of Rρ(fλ) is an unambiguous additiveidentity, attribution of the various sources of error in Propositions 2–3 is somewhat moresubtle. Still, we view this as a useful heuristic.

Rρ(fλ)

Vρ(gλ)

Approx. Error

Infinite-dim.Finite-dim.

Intrinsic Variance

Bρ(gλ)

Approx. Error

Infinite-dim.Finite-dim.

Intrinsic Bias

Fig 1. Schematic for decomposition of risk bound.

In vector-matrix notation (i.e., working in `2(N) under the isometry (13)) the approximation(10) can be rewritten as

1

nΦ>Φ ≈ T . (17)

Page 10: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 10

The approximation error in (17) is essentially the source of the approximation error termsin the schematic, Figure 1. Controlling this error is a key piece of our strategy for provingPropositions 2–3; it relies on large deviations results for random matrices and Hilbert-Schmidtoperators, which have been developed in (Minsker, 2011; Tropp, 2015). The necessary technicallemmas for our bounds are derived in Appendix A.

Before proceeding, we aim to provide more intuition behind the finite-dimensional andinfinite-dimensional approximation error cited in Figure 1. Note that the most direct approachto bounding the overall approximation error in (17) is to bound

δ =

∣∣∣∣∣∣∣∣ 1nΦ>Φ− T∣∣∣∣∣∣∣∣ , (18)

where || · || denotes the operator norm (this is straightforward using the results of Minsker(2011) and Tropp (2015)). However, in order to obtain improved bounds on Bρ(gλ) and Vρ(gλ),we decompose n−1Φ>Φ−T into a finite-dimensional block and an infinite-dimensional “tail,”and consider these terms separately. More specifically, let J ∈ N be a fixed positive integerand define the block decompositions

Φ = (Φ0 Φ1) , T =

(T0 00 T1

),

where Φ0 = (φj(xi))1≤i≤n;1≤j≤J , Φ1 = (φj(xi))1≤i≤n;J+1≤j<∞, T0 = diag(t21, . . . , t2J), and T1 =

diag(t2J+1, t2J+2, . . . ). Additionally, define

δ0 =

∣∣∣∣∣∣∣∣ 1nΦ>0 Φ0 − T0

∣∣∣∣∣∣∣∣ , δ1 =

∣∣∣∣∣∣∣∣( 0 1nΦ>

0 Φ11nΦ>

1 Φ01nΦ>

1 Φ1 − T1

)∣∣∣∣∣∣∣∣ . (19)

Observe that

1

nΦ>Φ− T =

(1nΦ>

0 Φ0 − T0 00 0

)+

(0 1

nΦ>

0 Φ11nΦ>

1 Φ01nΦ>

1 Φ1 − T1

)(20)

and δ ≤ δ0 + δ1. The matrix n−1Φ>0 Φ0−T0 has dimension J ×J and contributes to the finite-

dimensional approximation error in (17); δ1 reflects the infinite-dimensional approximationerror.

Explicit bounds on the bias and variance terms Bρ(gλ) and Vρ(gλ) are given in Propositions2–3 below. While the connection between these bounds and δ0, δ1 may not be immediatelytransparent from the statement of the results, it is developed more fully in the correspondingproofs (found in Appendix B).

Page 11: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 11

Proposition 2. Let Bρ(gλ) be the bias term given in Proposition 1 and suppose that {τ 2j }∞j=1

is a sequence of non-negative real numbers satisfying

supJ∈N, x∈X

τ 2J

J∑j=1

ψj(x)2 ≤ 1. (21)

(Observe that τ 2j = t2j/κ2 always satisfies this condition.) Then

Bρ(gλ) ≤

12

t2Jλ2 +

408κ2

nt2Jtr(T1) +

180κ4

nt2J+ 13t2J+1 + κ2J

(2

e

) τ2Jn

2

||f †||2H.Proposition 3. Let Vρ(gλ) be the variance term given in Proposition 1 and suppose that{τ 2j }∞j=1 is a sequence of non-negative real numbers satisfying (21). Then

Vρ(gλ) ≤

2J +2Jt2J+1

λ+

1

λtr(T1) +

32J

λ

{κ2

ntr(T1)

}1/2

+12κ2J

λn+κ2J

λ

(2

e

) τ2Jn

2

σ2

n.

As mentioned above, it is a fairly subtle task to attribute each term in the upper boundsof Propositions 2-3 to a certain type of approximation error or the intrinsic bias or variance.However, a simple rule of thumb is that each appearance of λ in the upper bounds is relatedto the intrinsic bias or variance; t2J and J are related to the finite-dimensional approximationerror; and t2J+1 and tr(T1) =

∑j>J t

2j are related to the infinite-dimensional approximation

error. Propositions 1–3 are easily combined to obtain our main theorem.

Theorem 1. Suppose that fλ is the estimator defined in (9) and suppose that {τ 2j }∞j=1 is asequence of non-negative real numbers satisfying (21). Then for all positive integers J ∈ N,

Rρ(fλ) ≤

2J +2Jt2J+1

λ+

1

λtr(T1) +

32J

λ

{κ2

ntr(T1)

}1/2

+12κ2J

λn+κ2J

λ

(2

e

) τ2Jn

2

σ2

n

+

12

t2Jλ2 +

408κ2

nt2Jtr(T1) +

180κ4

nt2J+ 13t2J+1 + κ2J

(2

e

) τ2Jn

2

||f †||2H.The constants in the bounds from Propositions 2–3 (and hence Theorem 1) have not been

optimized and can likely be improved. However, we will see in the next section that Theorem1 implies the risk of fλ achieves the minimax rate in a variety of important examples. Bauer

Page 12: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 12

et al. (2007) have previously obtained risk bounds on general regularization estimators similarto fλ. However, their bounds (e.g. Theorem 10 in (Bauer et al., 2007)) are independent of theambient RKHS H, i.e. they do not depend on the eigenvalues {t2j}. Our bounds are tighterthan those in (Bauer et al., 2007) because we take advantage of the structure of H. In contrastwith our Theorem 1, Bauer et al.’s (2007) results do not give minimax bounds (not easily, atleast), because minimax rates must depend on the t2j (see Corollaries 1–4 below).

6. Consequences

In this section, we use Theorem 1 to derive minimax optimal rates for all regularization fam-ilies, and to demonstrate the adaptation properties for regularization families with sufficientqualification.

6.1. Minimax Optimality

For 0 < r <∞, define

BH(r) =

{f =

∞∑j=1

αjψj ∈ L2(ρX );∞∑j=1

α2j

t2j≤ r2

}= {f ∈ H; ||f ||2H ≤ r2} ⊆ H.

The set BH(r) can be interpreted as an ellipsoid in L2(ρX ) or a ball in H. There is anextensive literature in statistics on nonparametric regression problems and minimax rates forestimating f † over ellipsoids of the form BH(r) (e.g., Nemirovskii, 1985; Nemirovskii et al.,1983; Tsybakov, 2004). These rates are typically determined by some condition on the rate ofdecay of t21 ≥ t22 ≥ · · · ≥ 0. In the classical statistics literature, the basis {ψj}∞j=1 and sequence{tj}∞j=1 are given directly, while in the present setting both are derived from the interactionbetween the kernel K and the distribution ρX . In this section, we show that for severalcommonly studied decay conditions on {t2j}∞j=1 and appropriate choice of the regularization

parameter λ, the estimator fλ achieves the minimax rate for estimating f † ∈ BH(r), asn→∞. Our results are summarized in the four corollaries to Theorem 1 given below; proofsmay be found in Appendix B.

Corollary 1 (Polynomial-decay kernels). Suppose there are constants C > 0 and ν > 1/2

such that 0 < t2j ≤ Cj−2ν for all j = 1, 2, . . . . Let λ = n−2ν

2ν+1 . Then

Rρ(fλ) = O{(||f †||2H + σ2

)n−

2ν2ν+1

}.

Page 13: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 13

Corollary 2 (Exponential-decay kernels). Suppose there are constants C, α > 0 such that0 < t2j ≤ Ce−αj for all j = 1, 2, . . . . Let λ = n−1 log(n). Then

Rρ(fλ) = O

{(||f †||2H + σ2

) log(n)

n

}.

Corollary 3 (Gaussian-decay kernels). Suppose there are constants C, α > 0 such that 0 <t2j ≤ Ce−αj

2for all j = 1, 2, . . . . Additionally, assume that

supJ∈N, x∈X

C−1e−αJJ∑j=1

ψj(x)2 ≤ 1. (22)

Let λ = n−1 log(n)1/2. Then

Rρ(fλ) = O

{(||f †||2H + σ2

) log(n)1/2

n

}.

Corollary 4 (Finite rank kernels). Suppose that 0 = t2J+1 = t2J+2 = · · · . Let λ = n−1. Then

Rρ(fλ) = O

{(||f †||2H + σ2

) Jn

}.

In Corollaries 1–4, the implicit constants in the big-O bound may depend on K and ρX ,but not on f †, σ2, and n. In fact, Theorem 1 can be used to derive explicit bounds on therisk Rρ(fλ) in Corollaries 1–4 that elucidate the dependence on the kernel; however, suchresults are not reported here for the sake of simplicity. The upper bounds in Corollaries1–4 immediately yield minimax optimality results over ellipsoids BH(r) ⊆ L2(ρX ) for thecorresponding kernels and eigenvalue-decay conditions (Belitser and Levit, 1995; Caponnettoand De Vito, 2007; Golubev et al., 1996; Ibragimov and Khas’minskii, 1983). It is noteworthythat the polynomial decay kernels and corresponding RKHSs in Corollary 1 correspond toSobolev spaces, which are of fundamental importance in nonparametric statistics and appliedmathematics (Tsybakov, 2004). Additionally, Corollaries 2 and 3 are particularly relevant forthe popular Gaussian kernels of the form K(x, x) = exp(−α2||x − x||2), α > 0 (Fasshauerand McCourt, 2012; Guo et al., 2002). Note that the additional condition (22) in Corollary3 is required to ensure that the finite-dimensional approximation error (discussed in Section5) decays fast enough. Condition (22) holds if, for instance, the eigenfunctions {ψj} areuniformly bounded. If (22) does not hold, one may obtain slightly weaker optimality resultsfor Gaussian-decay kernels where ||f † − fλ||2ρX converges at the minimax rate in probability,rather than expectation (Caponnetto and De Vito, 2007).

Page 14: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 14

6.2. Adaptation

Our results in the previous sections apply to all regularization families satisfying conditions(R1)–(R3). In this section, we derive results that depend on the qualification of the regular-ization (7) and the specific regularization family.

A family of estimators {fλ}∞j=1 is adaptive if it can achieve the minimax rate for estimatingf † over a wide range of regions or subsets ofH. The qualification of the regularization is relatedhow much a given regularization family can adapt to the regularity of the signal f †; generally,higher qualification corresponds to better adaptation (recall that ridge regularization hasqualification 1 and principal component regularization has qualification ∞). In Section 6.2.1,we show that regularization families with sufficiently high qualification are minimax overBH0(r) ⊆ H0 for Hilbert spaces H0 ⊆ H, which consist of functions that are “even moreregular” (e.g., more smooth) than those in H. In Section 6.2.2, we restrict our attentionto principal component regularization and show that KPCR can effectively adapt to finite-rank signals. On the other hand, regularization families with lower qualification (e.g., ridgeregularization) are known to saturate and the corresponding estimators are sub-optimal oversubsets corresponding to highly-regular signals in many instances (Dhillon et al., 2013; Dicker,2015; Mathe, 2005).

6.2.1. Adaptation in Sobolev Spaces (Polynomial-Decay Eigenvalues)

In this section, we assume that

t2j = j−2ν , j = 1, 2, . . . (23)

for some ν > 1/2. Let f =∑∞

j=1 θjψj ∈ L2(ρX ). Then f ∈ H if and only if∑∞

j=1 j2νθ2j <∞.

For µ ≥ 0, define

Hµ =

{f =

∞∑j=1

θjψj ∈ L2(ρX );∞∑j=1

j2ν(µ+1)θ2j <∞

}⊆ H.

The Hµ is itself a Hilbert space with norm ||f ||2Hµ =∑∞

j=1 j2ν(µ+1)θ2j . Observe that H = H0

and || · ||H = || · ||H0 . Corollary 1 implies that fλ is minimax over BH(r) for any regularizationfamily {gλ}λ>0. The next result, which is proved in Appendix B, implies that fλ is minimaxover BHµ(r) ⊆ Hµ ⊆ H, provided the qualification of {gλ}λ>0 is at least dµ/2 + 1e.Proposition 4. Assume that (23) holds and that the qualification of {gλ}λ≥0 is at least

dµ/2 + 1e. Let µ > 0. If f † ∈ Hµ and λ = n−2ν

2ν(µ+1)+1 , then

Rρ(fλ) = O{(||f †||2Hµ + σ2

)n−

2ν(µ+1)2ν(µ+1)+1

}.

Page 15: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 15

As in Corollaries 1–4, the constants in the big-O bounds in this proposition and the nextmay depend on the kernel K. Since Hµ2 ⊆ Hµ1 for µ1 ≤ µ2, we see that greater adaptation ispossible via Proposition 4 for regularization families with higher qualification. Observe thatProposition 4 does not apply to KRR for any positive µ > 0, while it applies to KPCR for allµ ≥ 0.

6.2.2. Adaptation to Finite-Dimensional Signals

In this section, we focus on KPCR and signals (6) with 0 = βJ+1 = βJ+2 = · · · .

Proposition 5. Let fλ be the KPCR estimator with principal component regularization (8)and assume that f † =

∑Jj=1 βjφj ∈ H. Fix 0 < r < 1. If t−2J = O(n1/2), (Jt4J)−1 = O(1), and

λ = (1− r)t2J , then

Rρ(fλ) = O

{(||f †||2H + σ2

) Jn

}.

The upper bound in Proposition 5 matches the bound in Corollary 4. However, Corollary 4relies on the structure of the kernel (i.e., the decay of the eigenvalues {t2j}∞j=1), while Proposi-tion 5 takes advantage of the structure of the signal f †. It follows from Proposition 5 that theKPCR estimator is minimax rate-optimal over J-dimensional signals in H for a very broadclass of kernels. On the other hand, it is known that KRR may perform dramatically worsethan KPCR in problems where f † lies in low-dimensional subspaces of the ambient space Hdue to the saturation effect (see, for example, (Dhillon et al., 2013)).

7. Discussion

Our unified analysis for a general class of regularization families in nonparametric regressionhighlights two important statistical properties. First, the results show minimax optimality forthis general class in several commonly studied settings, which was only previously establishedfor specific regularization methods. Second, the results demonstrate the adaptivity of certainregularization families to sub-ellipsoids of the RKHS, showing that these techniques may takeadvantage of additional smoothness properties that the signal may possess. It is notable thatthe most well-studied family, KRR/Tikhonov regularization, does not possess this adaptationproperty.

The minimax optimality results were obtained using a priori settings of the regularizationparameter λ that depend on the rate of decay of the eigenvalues corresponding to the kernelK and the distribution ρX . A data-driven estimator that chooses λ based on cross validationwas proposed for a different class of regularized estimators by Caponnetto and Yao (2010) in asemi-supervised setting. Adapting a similar technique to the estimator fλ studied in our work

Page 16: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 16

may be needed to fully take advantage of the adaptivity of sufficiently qualified regularizationfamilies.

Kernel methods were popularized as a computational “trick” for coping with high- and eveninfinite-dimensional feature spaces. However, they are now often regarded as computationallyprohibitive for many very large data problems on account of their standard implementations’worst-case O(n3) time and O(n2) space complexity. Approximation methods for coping withthis computational complexity include Nystrom approximation (Drineas and Mahoney, 2005;Williams and Seeger, 2001), matrix sketching (Gittens and Mahoney, 2013), and randomfeatures (Lopez-Paz et al., 2014; Rahimi and Recht, 2008). Recent work has considered theeffect of these approximations on the statistical behavior of KRR (Bach, 2013; Yang et al.,2015). It is natural to also consider these approximations in the context of other regularizationfamilies—especially KPCR, given its connection to low-rank approximations.

Appendix A: Lemmas Required for Results in the Main Text

As in Propositions 2–3, let J ∈ N be a fixed positive integer. Define the n × J matrixΨ0 = (ψj(xi))1≤i≤n;1≤j≤J and let λmin(n−1Ψ>

0 Ψ0) be the smallest eigenvalue of n−1Ψ>0 Ψ0.

Additionally define the event

A(δ) =

{λmin

(1

nΨ>

0 Ψ0

)≥ δ

}.

The next two lemmas are key to proving Propositions 2 and 3, respectively.

Lemma 1. On the event A(1/2),∣∣∣∣∣∣∣∣T 1/2

{I − gλ

(1

nΦ>Φ

)1

nΦ>Φ

∣∣∣∣∣∣∣∣2 ≤ (12

t2Jλ2 +

12

t2Jδ21 + 13t2J+1

)||f †||2H,

where δ1 is defined in (19)

Proof. Let

IB =

∣∣∣∣∣∣∣∣T 1/2

{I − gλ

(1

nΦ>Φ

)1

nΦ>Φ

∣∣∣∣∣∣∣∣2 .

Page 17: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 17

Then

IB =

∣∣∣∣∣∣∣∣( T 1/20 00 0

){gλ

(1

nΦ>Φ

)1

nΦ>Φ− I

∣∣∣∣∣∣∣∣2+

∣∣∣∣∣∣∣∣( 0 0

0 T 1/21

){gλ

(1

nΦ>Φ

)1

nΦ>Φ− I

∣∣∣∣∣∣∣∣2≤ IB0 + t2J+1

∣∣∣∣∣∣∣∣gλ( 1

nΦ>Φ

)1

nΦ>Φ− I

∣∣∣∣∣∣∣∣2 ||f †||2H≤ IB0 + t2J+1||f †||2H, (24)

where the second-to-last inequality uses the fact that ||β||2 = ||f †||2H, the last bound followsfrom (R2), and

IB0 =

∣∣∣∣∣∣∣∣( T 1/20 00 0

){gλ

(1

nΦ>Φ

)1

nΦ>Φ− I

∣∣∣∣∣∣∣∣2 .Since we are on the event A(1/2),

IB0 =

∣∣∣∣∣∣∣∣( T 1/20

(1nΦ>

0 Φ0

)−10

0 0

)(1nΦ>

0 Φ0 00 0

){gλ

(1

nΦ>Φ

)1

nΦ>Φ− I

∣∣∣∣∣∣∣∣2≤

∣∣∣∣∣∣∣∣∣∣T 1/2

0

(1

nΦ>

0 Φ0

)−1∣∣∣∣∣∣∣∣∣∣2 ∣∣∣∣∣∣∣∣( 1

nΦ>

0 Φ0 00 0

){gλ

(1

nΦ>Φ

)1

nΦ>Φ− I

∣∣∣∣∣∣∣∣2≤ 4

t2J

∣∣∣∣∣∣∣∣( 1nΦ>

0 Φ0 00 0

){gλ

(1

nΦ>Φ

)1

nΦ>Φ− I

∣∣∣∣∣∣∣∣2≤ 12

t2J

∣∣∣∣∣∣∣∣( 1

nΦ>Φ

){gλ

(1

nΦ>Φ

)1

nΦ>Φ− I

∣∣∣∣∣∣∣∣2+

12

t2J

∣∣∣∣∣∣∣∣( 0 1nΦ>

0 Φ11nΦ>

1 Φ01nΦ>

1 Φ1 − T1

){gλ

(1

nΦ>Φ

)1

nΦ>Φ− I

∣∣∣∣∣∣∣∣2+

12

t2J

∣∣∣∣∣∣∣∣( 0 00 T1

){gλ

(1

nΦ>Φ

)1

nΦ>Φ− I

∣∣∣∣∣∣∣∣2≤{

12

t2J(λ2 + δ21) + 12t2J+1

}||f †||2H

The lemma follows by combining this bound on IB0 with (24).

Page 18: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 18

Lemma 2. On the event A(1/2),

1

n2E

{∣∣∣∣∣∣∣∣T 1/2gλ

(1

nΦ>Φ

)Φ>ε

∣∣∣∣∣∣∣∣2∣∣∣∣∣X}≤{

2J +2J

λ(δ1 + t2J+1) +

1

λtr(T1)

}σ2

n.

Proof. Suppose we are on the event A(1/2) and let

1

n2E

{∣∣∣∣∣∣∣∣T 1/2gλ

(1

nΦ>Φ

)Φ>ε

∣∣∣∣∣∣∣∣2∣∣∣∣∣X}

= IV0 + IV1, (25)

where

IV0 =1

n2E

{∣∣∣∣∣∣∣∣( T 1/20 00 0

)gλ

(1

nΦ>Φ

)Φ>ε

∣∣∣∣∣∣∣∣2∣∣∣∣∣X},

IV1 =1

n2E

{∣∣∣∣∣∣∣∣( 0 0

0 T 1/21

)gλ

(1

nΦ>Φ

)Φ>ε

∣∣∣∣∣∣∣∣2∣∣∣∣∣X}.

We bound IV0 and IV1 separately.To bound IV0, first we use (2) to obtain

IV0 ≤σ2

n2tr

{Φgλ

(1

nΦ>Φ

)(T0 00 0

)gλ

(1

nΦ>Φ

)Φ>

}. (26)

If A,B are trace class operators, then tr(AB) = tr(BA); if, furthermore, they are positive andself-adjoint, then tr(AB) ≤ ||A||tr(B) (see, for example, Theorem 18.11 of (Conway, 1999)).

Page 19: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 19

Thus,

σ2

n2tr

{Φgλ

(1

nΦ>Φ

)(T0 00 0

)gλ

(1

nΦ>Φ

)Φ>

}=σ2

ntr

{gλ

(1

nΦ>Φ

)2(1

nΦ>Φ

)(T0 00 0

)}

=σ2

ntr

{( (1nΦ>

0 Φ0

)1/20

0 0

)gλ

(1

nΦ>Φ

)2(1

nΦ>Φ

)( (1nΦ>

0 Φ0

)1/20

0 0

)·( (

1nΦ>

0 Φ0

)−1/2 T0 ( 1nΦ>0 Φ0

)−1/20

0 0

)}≤ σ2

n

∣∣∣∣∣∣∣∣∣∣( (

1nΦ>

0 Φ0

)1/20

0 0

)gλ

(1

nΦ>Φ

)2(1

nΦ>Φ

)( (1nΦ>

0 Φ0

)1/20

0 0

)∣∣∣∣∣∣∣∣∣∣

· tr

{T0(

1

nΦ>

0 Φ0

)−1}

=σ2

n

∣∣∣∣∣∣∣∣ 1√n

Φgλ

(1

nΦ>Φ

)(1nΦ>

0 Φ0 00 0

)gλ

(1

nΦ>Φ

)1√n

Φ>

∣∣∣∣∣∣∣∣· tr

{T0(

1

nΦ>

0 Φ0

)−1}

≤ 2Jσ2

n

∣∣∣∣∣∣∣∣ 1√n

Φgλ

(1

nΦ>Φ

)(1nΦ>

0 Φ0 00 0

)gλ

(1

nΦ>Φ

)1√n

Φ>

∣∣∣∣∣∣∣∣ ,where the last inequality follows from the fact that we are on the event A(1/2). Combiningthis with (26) yields

IV0 ≤2Jσ2

n

∣∣∣∣∣∣∣∣ 1√n

Φgλ

(1

nΦ>Φ

)(1nΦ>

0 Φ0 00 0

)gλ

(1

nΦ>Φ

)1√n

Φ>

∣∣∣∣∣∣∣∣ .

Page 20: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 20

Next, we use the decomposition (20) and the regularization conditions (R1) and (R3),

IV0 ≤2Jσ2

n

∣∣∣∣∣∣∣∣∣∣gλ(

1

nΦ>Φ

)2(1

nΦ>Φ

)2∣∣∣∣∣∣∣∣∣∣

+2Jσ2

n

∣∣∣∣∣∣∣∣ 1√n

Φgλ

(1

nΦ>Φ

)(0 1

nΦ>

0 Φ11nΦ>

1 Φ01nΦ>

1 Φ1 − T1

)gλ

(1

nΦ>Φ

)1√n

Φ>

∣∣∣∣∣∣∣∣+

2Jσ2

n

∣∣∣∣∣∣∣∣ 1√n

Φgλ

(1

nΦ>Φ

)(0 00 T1

)gλ

(1

nΦ>Φ

)1√n

Φ>

∣∣∣∣∣∣∣∣≤{

2J +2J

λ(δ1 + t2J+1)

}σ2

n(27)

Finally, to bound IV1,

IV1 ≤σ2

ntr

{(0 00 T1

)gλ

(1

nΦ>Φ

)2(1

nΦ>Φ

)}

≤ σ2

ntr

{(0 00 T1

)} ∣∣∣∣∣∣∣∣∣∣gλ(

1

nΦ>Φ

)2(1

nΦ>Φ

)∣∣∣∣∣∣∣∣∣∣

≤ σ2

λntr(T1). (28)

The results follows by combining (25) and (27)–(28).

In order to prove Theorem 1 and other results in the main text we require large deviationand moment bounds on quantities related to n−1Φ>Φ − T . These bounds are collected inLemmas 3–5 and Corollary 5 below. Lemma 3 is a general large deviation bound for sumsof self-adjoint Hilbert-Schmidt operators; the other results are more tailored to the settingstudied in the main text.

Lemma 3. Let R > 0 be a positive real constant and consider a finite sequence of self-adjoint Hilbert-Schmidt operators {Xi}ni=1 satisfying E(Xi) = 0 and ||Xi|| ≤ R almost surely.Define Y =

∑ni=1 Xi and suppose there are constants C, d > 0 satisfying ||E(Y2)|| ≤ C and

tr{E(Y 2)} ≤ dC. For all t ≥ C1/2 +R/3,

P (||Y|| ≥ t) ≤ 4d exp

(−t2/2

C +Rt/3

).

Proof. This is a straightforward generalization of Theorem 7.7.1 of (Tropp, 2015), using thearguments from Section 4 from (Minsker, 2011) to extend from self-adjoint matrices to self-adjoint Hilbert-Schmidt operators.

Page 21: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 21

Corollary 5. Suppose the conditions of Lemma 3 are satisfied and that q > 0 is a constant.Then

E(||Y||) ≤ (1 + 15d)C1/2 +

(1 + 16d

3

)R,

E(||Y||2) ≤ (2 + 32d)C +

(2 + 128d

9

)R2,

E(||Y||q) = O{

(1 + d)(Cq/2 +Rq)},

where the implicit constant in the big-O bound may depend on q, but not d, C, or R.

Proof. We bound E(||Y||) first. Indeed,

E(||Y||) =

∫ ∞0

P(||Y|| ≥ t) dt

≤ C1/2 +R

3+

∫ ∞C1/2+R/3

4d exp

(−t2/2

C +Rt/3

)dt

≤ C1/2 +R

3+

∫ 3C/R

0

4d exp

(− t2

4C

)dt+

∫ ∞3C/R

4d exp

(− 3t

4R

)dt

= C1/2 +

(1

3+

16

3de−

9C4R2

)R +

∫ 3C/R

0

4d exp

(− t2

4C

)dt

≤(1 + 8dπ1/2

)C1/2 +

(1 + 16de−

9C4R2

) R3

≤ (1 + 15d)C1/2 +

(1 + 16d

3

)R.

Page 22: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 22

For the second moment bound,

E(||Y||2) =

∫ ∞0

P(||Y|| ≥ t1/2

)dt

≤(C1/2 +

R

3

)2

+

∫ ∞(C1/2+R

3 )2

4d exp

(−t/2

C +Rt1/2/3

)dt

≤(C1/2 +

R

3

)2

+

∫ 9C2/R2

0

4d exp

(− t

4C

)dt+

∫ ∞9C2/R2

4d exp

(−3t1/2

4R

)dt

=

(C1/2 +

R

3

)2

+ 16Cd

{1− exp

(− 9C

4R2

)}+

128R2d

9

{9C

4R2+ 1

}exp

(− 9C

4R2

)=

(C1/2 +

R

3

)2

+ 16Cd

{1 + exp

(− 9C

4R2

)}+

128R2d

9exp

(− 9C

4R2

)≤(C1/2 +

R

3

)2

+ 32Cd+128R2d

9

≤ (2 + 32d)C +

(2 + 128d

9

)R2.

Finally, to bound E(||Y||q),

E(||Y||q) =

∫ ∞0

P(||Y|| ≥ t1/q

)dt

≤(C1/2 +

R

3

)q+

∫ ∞(C1/2+R

3 )q

4d exp

(−t2/q/2

C +Rt1/q/3

)dt

≤(C1/2 +

R

3

)q+

∫ (3C/R)q

0

4d exp

(−t

2/q

4C

)dt+

∫ ∞(3C/R)q

4d exp

(−3t1/q

4R

)dt

= O{

(1 + d)(Cq/2 +Rq)},

as was to be shown.

Lemma 4. Let δ, δ1 be as defined in (18)–(19) and let q > 0 be a constant. Then

E(δ) ≤ 16κ2

n1/2+

6κ2

n, (29)

E(δ2) ≤ 34κ4

n+

15κ4

n2, (30)

E(δq) = O

{(κ4

n

)q/2}(31)

Page 23: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 23

and

E(δ1) ≤ 16

{κ2

ntr(T1)

}1/2

+6κ2

n, (32)

E(δ21) ≤ 34κ2

ntr(T1) +

15κ4

n2, (33)

E(δq1) = O

[{κ2

ntr(T1)

}q/2+

(κ2

n

)q]. (34)

Proof. These bounds follow from Corollary 5. To prove the bounds on δ, we take Y =n−1Φ>Φ− T =

∑ni=1 Xi and Xi = n−1(φiφ

>i − T ), where φi = (φ1(xi), φ2(xi), ...)

>. ClearlyE(Xi) = 0 and ||Xi|| ≤ κ2/n. Additionally, since

E(Y2) =1

nE{

(φiφ>i − T )2

}=

1

nE(||φi||2φiφ>

i )− 1

nT 2,

it follows that ||E(Y2)|| ≤ κ4/n and tr{E(Y2)} ≤ κ4/n. The bounds (29)–(31) now followfrom a direct application of Corollary 5 with C = κ4/n, d = 1, and R = κ2/n.

To prove (32)–(34), take

Y =

(0 1

nΦ>

0 Φ11nΦ>

1 Φ01nΦ>

1 Φ1 − T1

)=

n∑i=1

Xi,

where

Xi =1

n

(0 φ0,iφ

>1,i

φ1,iφ>0,i φ1,iφ

>1,i − T1

)and

φ0,i = (φ1(xi), ..., φJ(xi))>, φ1,i = (φJ+1(xi), φJ+2(xi), ...)

>.

Observe that E(Xi) = 0 and ||Xi|| ≤ κ2/n. Additionally, since

E(Y2) = nE(X2i )

=1

nE{(

||φ1,i||2φ0,iφ>0,i φ0,iφ

>1,i(φ1,iφ

>1,i − T1)

(φ1,iφ>1,i − T1)φ1,iφ

>0,i (φ1,iφ

>1,i − T1)2

)}=

1

nE(||φ1,i||2φiφ>

i

)− 1

n

(0 00 T 2

1

).

it follows that ||E(Y2)|| ≤ κ2t2J+1/n ≤ κ2tr(T1)/n and tr{E(Y2)} ≤ κ2tr(T1)/n. Bounds(32)–(34) follow from Corollary 5, with C = κ2tr(T1)/n, d = 1, and R = κ2/n.

Page 24: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 24

Lemma 5. Suppose that τ 2J ≥ 0 and

supx∈X

J∑j=1

ψj(x)2 ≤ 1

τ 2J.

If 0 ≤ δ < 1, then

Pr{A(δ)} ≥ 1− J{

e−δ

(1− δ)1−δ

}τ2Jn.

Proof. If τ 2J = 0, the result is trivial. For τ 2J > 0, this is a direct application of Theorem 5.1.1from (Tropp, 2015). Indeed, consider the decomposition

1

nΨ>

0 Ψ0 =n∑i=1

Xi,

where

Xi =1

nψ0,iψ

>0,i

and ψ0,i = (ψ1(xi), ..., ψJ(xi))>. The lemma follows by Theorem 5.1.1 of (Tropp, 2015), upon

noticing that

||Xi|| =∣∣∣∣∣∣∣∣ 1nψ0,iψ

>0,i

∣∣∣∣∣∣∣∣ =1

n

J∑j=1

ψj(xi)2 ≤ 1

τ 2Jn

and

E(

1

nΨ>

0 Ψ0

)= I.

Appendix B: Proofs of Results from the Main Text

Proof of Proposition 2 We decompose Bρ(gλ) according to whether we are on the eventA(1/2) or not. Let

Bρ(gλ) = Bρ{gλ;A(1/2)}+ Bρ{gλ;A(1/2)c}, (35)

where

Bρ{gλ;A(1/2)} = E

[∣∣∣∣∣∣∣∣T 1/2

{I − gλ

(1

nΦ>Φ

)1

nΦ>Φ

∣∣∣∣∣∣∣∣2 ;A(1/2)

],

Bρ{gλ;A(1/2)} = E

[∣∣∣∣∣∣∣∣T 1/2

{I − gλ

(1

nΦ>Φ

)1

nΦ>Φ

∣∣∣∣∣∣∣∣2 ;A(1/2)

].

Page 25: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 25

By Lemmas 1 and 4,

Bρ{gλ;A(1/2)} ≤ E{(

12

t2Jλ2 +

12

t2Jδ21 + 13t2J+1

)||f †||2H;A(1/2)

}≤{

12

t2Jλ2 +

12

t2JE(δ21) + 13t2J+1

}||f †||2H

≤{

12

t2Jλ2 +

408κ4

nt2Jtr(T1) +

180κ4

n2t2J+ 13t2J+1

}||f †||2H. (36)

On the other hand, since∣∣∣∣∣∣∣∣T 1/2

{I − gλ

(1

nΦ>Φ

)1

nΦ>Φ

∣∣∣∣∣∣∣∣2 ≤ κ2||f †||2H,

it follows from Lemma 5 that

Bρ{gλ;A(1/2)c} ≤ κ2||f †||2H Pr{A(1/2)c} ≤ κ2||f †||2HJ(

2

e

) τ2Jn

2

. (37)

The proposition follows by combining (35)–(37). �Proof of Proposition 3 The steps of the proof are similar to those in the proof of Proposition2. Let

Vρ(gλ) = Vρ{gλ;A(1/2)}+ Vρ{gλ;A(1/2)c}, (38)

where

Vρ{gλ;A(1/2)} =1

n2E

{∣∣∣∣∣∣∣∣T 1/2gλ

(1

nΦ>Φ

)Φ>ε

∣∣∣∣∣∣∣∣2 ;A(1/2)

},

Vρ{gλ;A(1/2)c} =1

n2E

{∣∣∣∣∣∣∣∣T 1/2gλ

(1

nΦ>Φ

)Φ>ε

∣∣∣∣∣∣∣∣2 ;A(1/2)c

}.

By Lemmas 1 and 4,

Vρ{gλ;A(1/2)} ≤ E[{

2J +2J

λ(δ1 + t2J+1) +

1

λtr(T1)

}σ2

n;A(1/2)

]≤[2J +

2J

λ{E(δ1) + t2J+1}+

1

λtr(T1)

]σ2

n

[2J +

32J

λ

{κ2

ntr(T1)

}1/2

+12κ2J

λn+

2Jt2J+1

λ+

1

λtr(T1)

]σ2

n. (39)

Page 26: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 26

On the other hand, since

1

n2E

{∣∣∣∣∣∣∣∣T 1/2gλ

(1

nΦ>Φ

)Φ>ε

∣∣∣∣∣∣∣∣2∣∣∣∣∣X}≤ κ2σ2

λn,

it follows from Lemma 5 that

Vρ{gλ;A(1/2)c} ≤ κ2σ2

λnPr{A(1/2)c} ≤ κ2σ2J

λn

(2

e

) τ2Jn

2

. (40)

The proposition follows by combining (38)–(40). �Proof of Corollary 1 Choose J in Theorem 1 so that t2J+1 < λ ≤ t2J . Then, J ≤ C

12ν λ−

12ν ,

n = λ−(1+ 12ν ) and, applying Theorem 1 with τ 2j = t2j/κ

2,

Rρ(fλ) ≤

{20C

12ν λ+ (1 + 16C

12ν κ2)λ

12ν

∞∑j=J+1

t2j

}σ2

+

{25λ+ 408κ2λ

12ν

∞∑j=J+1

t2j

}||f †||2H + o

{(σ2 + ||f †||2H

)λ}.

Since λ = n−2ν

2ν+1 , the corollary will follow if we can show that

∞∑j=J+1

t2j = O(λ1−12ν ). (41)

Let J0 =⌊C

12ν λ−

12ν

⌋+ 1. Then

∞∑j=J+1

t2j ≤J0∑

j=J+1

t2j +∞∑

j=J0+1

t2j

≤ J0λ+ C

∫ ∞J0

t−2ν dt

≤ J0λ+C

2ν − 1J1−2ν0

≤(

1 +2ν

2ν − 1C

12ν

)λ1−

12ν , (42)

Page 27: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 27

which yields (41). This completes the proof of Corollary 1. �Proof of Corollary 2 Choose J in Theorem 1 so that

t2J+1 <2κ2

log(e/2)λ ≤ t2J .

Then J = O{log(n)} and it follows from Theorem 1 (with τ 2j = t2j/κ2) that

Rρ(fλ) ≤ O

[(σ2 + ||f †||2H

){λ+

1

log(n)

∞∑j=J+1

t2j

}].

The corollary will follow if we can show that

∞∑j=J+1

t2j = O

{log(n)2

n

}.

Let J0 = bα−1 log(n)c+ 1. Then

∞∑j=J+1

t2j ≤ J0t2J+1 +

∞∑j=J0+1

t2j ≤ O

{log(n)2

n

}+ C

∑j=J0+1

e−αj = O

{log(n)2

n

},

as was to be shown. �Proof of Corollary 3 Choose J in Theorem 1 so that t2J+1 < λ ≤ t2J . Then

J = O{

log(n)1/2}

and, by Theorem 1 with τ 2j = C−1e−αj,

Rρ(fλ) = O

[(σ2 + ||f †||2H

){λ+

1

log(n)1/2

∞∑j=J+1

t2j

}].

The corollary will follow if we can show that

∞∑j=J+1

t2j = O

{log(n)

n

}.

Let

J0 =

⌈{1

αlog

(C

2αλ

)}1/2⌉.

Page 28: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 28

Then∞∑

j=J+1

t2j ≤ J0λ+ C

∫ ∞J0

e−αt2

dt ≤ J0λ+C

2αJ0e−αJ

20 = O

{log(n)

n

},

as was to be shown. �Proof of Corollary 4 To prove this corollary, the strategy is the same as in Corollaries 1–4.However, we must modify the bounds in Lemmas 1-2. First note that T1 = 0. Then followingthe notation in the proof of Lemmas 1-2, on the event A(1/2),

IB = IB0

∣∣∣∣∣∣∣∣∣∣T 1/2

0

(1

nΦ>

0 Φ0

)−1/2∣∣∣∣∣∣∣∣∣∣2 ∣∣∣∣∣∣∣∣( ( 1nΦ>

0 Φ0

)1/20

0 0

){gλ

(1

nΦ>Φ

)1

nΦ>Φ− I

∣∣∣∣∣∣∣∣2≤ tr

{T0(

1

nΦ>

0 Φ0

)−1}∣∣∣∣∣∣∣∣( ( 1nΦ>0 Φ0

)1/20

0 0

){gλ

(1

nΦ>Φ

)1

nΦ>Φ− I

∣∣∣∣∣∣∣∣2≤ 2Jλ||f †||2H.

It follows that

Bρ(gλ) ≤

2Jλ+ κ2J

(2

e

) τ2Jn

2

||f †||2H. (43)

Additionally, on the event A(1/2),

IV = IV0 ≤σ2

ntr

{T0(

1

nΦ>

0 Φ0

)−1}∣∣∣∣∣∣∣∣gλ( 1

nΦ>Φ

)(1

nΦ>Φ

)∣∣∣∣∣∣∣∣2 ≤ 2Jσ2

n.

Thus,

Vρ(gλ) ≤

2J +κ2J

λ

(2

e

) τ2Jn

2

σ2

n. (44)

Combining Proposition 1 and (43)–(44), we obtain

Rρ(fλ) ≤

2Jλ+ κ2J

(2

e

) τ2Jn

2

||f †||2H +

2J +κ2J

λ

(2

e

) τ2Jn

2

σ2

n.

Corollary 4 follows immediately. �

Page 29: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 29

Proof of Proposition 4 Pick J so that t2J+1 < λ ≤ t2J . For real numbers µ1, µ2 ≥ 0 andfunctions h : [0,∞)→ [0,∞), define

Kµ1,µ2(h) =

∣∣∣∣∣∣∣∣T µ1h( 1

nΦ>Φ

)T µ2

∣∣∣∣∣∣∣∣ .Also define the operator S by Sh(t) = th(t). Let hλ(t) = 1− gλ(t)t. Finally, suppose that

f † =∞∑j=1

βjφj =∞∑j=1

αjtµj φj ∈ Hµ.

Then ||f †||2Hµ = ||α||2, where α = (α1, α2, ...)>.

To prove the proposition, we follow the same strategy as in the proofs of Propositions 2–3;however, several modifications are required along the way. Following the notation of Lemmas1–2, our first task is to bound IB. We have

IB =

∣∣∣∣∣∣∣∣T 1/2

{I − gλ

(1

nΦ>Φ

)(1

nΦ>Φ

)}β

∣∣∣∣∣∣∣∣2=

∣∣∣∣∣∣∣∣T 1/2

{I − gλ

(1

nΦ>Φ

)(1

nΦ>Φ

)}T µ/2α

∣∣∣∣∣∣∣∣2≤ K1/2,µ/2(hλ)

2||f †||2Hµ . (45)

Furthermore, on A(1/2),

K1/2,µ/2(hλ) =

∣∣∣∣∣∣∣∣( T 1/20 00 0

)hλ

(1

nΦ>Φ

)T µ/2

∣∣∣∣∣∣∣∣+

∣∣∣∣∣∣∣∣( 0 0

0 T 1/21

)hλ

(1

nΦ>Φ

)T µ/2

∣∣∣∣∣∣∣∣≤

∣∣∣∣∣∣∣∣∣∣T 1/2

0

(1

nΦ>

0 Φ0

)−1∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣( 1

nΦ>

0 Φ0 00 0

)hλ

(1

nΦ>Φ

)T µ/2

∣∣∣∣∣∣∣∣+ tJ+1K0,µ/2(hλ)

≤ 2

tJ

{K0,µ/2(Shλ) + (δ1 + t2J+1)K0,µ/2(hλ)

}+ tJ+1K0,µ/2(hλ)

≤ 2

tJK0,µ/2(Shλ) +

(2δ1tJ

+ 3tJ+1

)K0,µ/2(hλ). (46)

Next we bound K0,µ/2(hλ) and K0,µ/2(Shλ). Suppose that µ/2 = m + γ, where m ≥ 0 isan integer and 0 ≤ g < 1. Recall the definition of δ from (18). By repeated application of the

Page 30: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 30

triangle inequality,

K0,µ/2(h) = Km+γ,0(h)

=

∣∣∣∣∣∣∣∣T m+γh

(1

nΦ>Φ

)∣∣∣∣∣∣∣∣≤∣∣∣∣∣∣∣∣T m−1+γSh( 1

nΦ>Φ

)∣∣∣∣∣∣∣∣+ δκ2(m−1+γ)∣∣∣∣∣∣∣∣h( 1

nΦ>Φ

)∣∣∣∣∣∣∣∣...

≤∣∣∣∣∣∣∣∣T γSmh( 1

nΦ>Φ

)∣∣∣∣∣∣∣∣+ δ

m∑j=1

κ2(m−j+γ)∣∣∣∣∣∣∣∣Sj−1h( 1

nΦ>Φ

)∣∣∣∣∣∣∣∣= K0,γ(S

mh) + δm∑j=1

κ2(m−j+γ)∣∣∣∣∣∣∣∣Sj−1h( 1

nΦ>Φ

)∣∣∣∣∣∣∣∣ . (47)

Bounding K0,γ(Smh) will require some additional manipulations. However, before proceeding

in this direction, we gather our bounds so far. By (46)–(47) and because {gλ} has qualificationat least dµ/2e,

K1/2,µ/2(hλ) = O

{1

λ1/2K0,γ(S

m+1hλ) +

(δ1λ1/2

+ λ1/2)K0,γ(S

mhλ) + δλ1/2 +δ1δ

λ1/2

}on A. Combining this with (45) yields

Bρ(gλ) = E{IB;A(1/2)}+ E{IB;A(1/2)c}

≤ O

[E{

1

λK0,γ(S

m+1hλ)2 +

(δ21λ

+ λ

)K0,γ(S

mhλ)2 + δ2λ+

δ21δ2

λ

}||f †||2Hµ

]+ κ2(µ+1)||f †||2Hµ Pr {A(1/2)c} .

By Lemma 4,

E(δ2) = O

(1

n

),

E(δ21δ2) = O

{tr(T1)n2

+1

n3

}.

(Note that in the current proof, we allow the big-O constants to depend on the kernel K,while this is not allowed in Lemma 4; see the comment following the statement of Proposition

Page 31: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 31

4 and the statement of Corollary 5.) Thus,

Bρ(gλ) ≤ O

[E{

1

λK0,γ(S

m+1hλ)2 +

(δ21λ

+ λ

)K0,γ(S

mhλ)2

}||f †||2Hµ

]+O

n||f †||2Hµ +

1

λ

{tr(T1)n2

+1

n3

}||f †||2Hµ

]+ κ2(µ+1)||f †||2Hµ Pr {A(1/2)c}

= O

[E{

1

λK0,γ(S

m+1hλ)2 +

(δ21λ

+ λ

)K0,γ(S

mhλ)2

}||f †||2Hµ

]+ o

(||f †||2Hλµ+1

), (48)

where we have also made use of Lemma 5 and (42) to bound Pr{A(1/2)c} and tr(T1), respec-tively.

To complete our analysis of the bias term Bρ(gλ), it remains to bound the expectationsinvolving K0,γ above. We consider separately the cases where 0 ≤ γ ≤ 1/2 and 1/2 < γ < 1.If 1/2 < γ < 1 and h : [0,∞) → [0,∞) is a nonnegative function, then, since z 7→ zγ isoperator monotone (Mathe and Pereverzev, 2002),

K0,γ(h) =

∣∣∣∣∣∣∣∣T γh( 1

nΦ>Φ

)∣∣∣∣∣∣∣∣≤∣∣∣∣∣∣∣∣T γ − ( 1

nΦ>Φ

)γ∣∣∣∣∣∣∣∣K0,0(h) +

∣∣∣∣∣∣∣∣( 1

nΦ>Φ

)γh

(1

nΦ>Φ

)∣∣∣∣∣∣∣∣≤ δγK0,0(h) +

∣∣∣∣∣∣∣∣( 1

nΦ>Φ

)γh

(1

nΦ>Φ

)∣∣∣∣∣∣∣∣ .Hence, taking h = Smhλ and h = Sm+1hλ, we conclude that if 1/2 < γ < 1, then

K0,γ(Smhλ) ≤ δγλm + λµ/2,

K0,γ(Sm+1hλ) ≤ δγλm+1 + λµ/2+1.

Thus, by Lemmas 4 and , if q > 0 is constant and 1/2 < γ < 1, then

E {K0,γ(Smhλ)

q} = O{n−qγ/2λqm + λqµ/2

}= O(λqµ/2), (49)

E{K0,γ(S

m+1hλ)q}

= O{n−qγ/2λq(m+1) + λq(µ/2+1)

}= O{λq(µ/2+1)}. (50)

Page 32: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 32

On the other hand, if 0 ≤ γ ≤ 1/2, then, on A(1/2),

K0,γ(h) =

∣∣∣∣∣∣∣∣T γh( 1

nΦ>Φ

)∣∣∣∣∣∣∣∣≤∣∣∣∣∣∣∣∣( T γ0 0

0 0

)h

(1

nΦ>Φ

)∣∣∣∣∣∣∣∣+ t2γJ+1K0,0(h)

∣∣∣∣∣∣∣∣∣∣T γ0

(1

nΦ>

0 Φ0

)−1∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣( 1

nΦ>

0 Φ0 00 0

)h

(1

nΦ>Φ

)∣∣∣∣∣∣∣∣+ t2γJ+1K0,0(h)

≤ 2t2(γ−1)J

{K0,0(Sh) + (δ1 + t2J+1)K0,0(h)

}+ t2γJ+1K0,0(h)

≤ 2λγ−1K0,0(Sh) + 2λγ−1δ1K0,0(h) + 3λγK0,0(h).

Again taking h = Smhλ and h = Sm+1hλ, if 0 ≤ γ ≤ 1/2 and we are on A(1/2), then

K0,γ(Smhλ) ≤ 2δ1λ

µ/2−1 + 5λµ/2,

K0,γ(Sm+1hλ) ≤ 2δ1λ

µ/2 + 5λµ/2+1.

Taking expectations, Lemma 4–5 imply that if 0 ≤ γ ≤ 1, then

E {K0,γ(Smhλ)

q} = O

[{tr(T1)n

}q/2λq(µ/2−1) +

1

nqλq(µ/2−1) + λqµ/2

]= O(λqµ/2), (51)

E{K0,γ(S

m+1hλ)q}

= O

[{tr(T1)n

}q/2λqµ/2 +

1

nrλqµ/2 + λq(µ/2+1)

]= O{λq(µ/2+1)}. (52)

Thus, combining (49)–(52), we see that

E {K0,γ(Smhλ)

q} = O(λqµ/2), (53)

E{K0,γ(S

m+1hλ)q}

= O{λq(µ/2+1)} (54)

for all 0 ≤ γ < 1. Finally, using (53)–(54) in (48), along with Lemma 4 yields

Bρ(gλ) = O

[E{

1

λK0,γ(S

m+1hλ)2 +

(δ21λ

+ λ

)K0,γ(S

mhλ)2

}||f †||2Hµ

]+ o

(||f †||2Hλµ+1

)= O

(λµ+1||f †||2H

). (55)

To complete the proof, we must bound Vρ(gλ), but this is a direct application of Proposition3: One easily checks that

Vρ(gλ) = O(λµ+1σ2). (56)

Page 33: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 33

The proposition follows by combining (55)–(56) with Proposition 1.�

Proof of Proposition 5 Let t21 ≥ t22 ≥ · · · ≥ 0 denote the eigenvalues of n−1ΦTΦ and defineT = diag(t21, t

22, ...). Let U be an orthogonal transformation satisfying n−1Φ>Φ = U T U>.

Additionally, let hλ(t) = I{t ≤ λ}, define J = Jλ = inf{j; t2j > λ}, and write U = (UJ UJc),

where UJ is the ∞× J matrix comprised of the first J columns of U and UJc consists of the

remaining columns of U . Finally, define T0 = diag(t2J+1

, t2J+2

, ...) and define the∞× J matrix

UJ = (IJ 0)>.Following the notation from the proof of Lemma 1, consider the following bound on IB,

IB =

∣∣∣∣∣∣∣∣T 1/2hλ

(1

nΦ>Φ

∣∣∣∣∣∣∣∣2 =∣∣∣∣∣∣T 1/2UJcU

>JcUJU

>J β∣∣∣∣∣∣2 ≤ κ2

∣∣∣∣∣∣U |topJc

UJ

∣∣∣∣∣∣2 ||f †||2H.It follows that

Bρ(gλ) = E(IB) ≤ κ2||f †||2HE(∣∣∣∣∣∣U>

JcUJ

∣∣∣∣∣∣2 ; J ≥ J

)+ κ2||f †||2H Pr(J < J). (57)

Now we bound ||U>JcUJ || on the event {J ≥ J}. We derive this bound from basic principles, but

it is essentially the Davis-Kahan inequality (Davis and Kahan, 1970). Let D = n−1Φ>Φ−T .Then

DUJ =1

nΦ>ΦUJ − T UJ =

1

nΦ>ΦUJ − UJT0

and

U>JDUJc = U>

J

(1

nΦ>Φ

)UJc − T0U

>J UJc = U>

J UJc T1 − T0U>J UJc .

Now observe that ∣∣∣∣∣∣∣∣ 1nΦ>Φ− T∣∣∣∣∣∣∣∣ ≥ ||U>

J ∆UJc ||

≥ ||T0U>J UJc|| − ||U

>J UJc T1||

≥ t2J ||U>J UJc|| − (1− r)t2J ||U>

J UJc ||= rt2J ||U>

J UJc ||.

Thus,

||U>J UJc|| ≤

1

rt2J

∣∣∣∣∣∣∣∣ 1nΦ>Φ− T∣∣∣∣∣∣∣∣ (58)

on {J > J}.

Page 34: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 34

Next we boundPr(J < J) = Pr(t2J ≤ λ) = Pr

{t2J ≤ (1− r)t2J

}.

By Weyl’s inequality,

|t2J − t2J | ≤ δ =

∣∣∣∣∣∣∣∣ 1nΦ>Φ− T∣∣∣∣∣∣∣∣ .

Additionally, by Lemma 3,

Pr(δ ≥ rt2J) ≤ 4 exp

(− nr2t4J

2κ4 + 2κ2rt2J/3

)Thus,

Pr(J < J) = Pr{t2J ≤ (1− r)t2J

}≤ Pr(δ ≥ rt2J) ≤ 4 exp

(− nr2t4J

2κ4 + 2κ2rt2J/3

). (59)

Combining (57)–(59) and using Lemma 4 gives

Bρ(gλ) ≤κ2

r2t4J||f †||2HE(δ2) + 4κ2||f †||2H exp

(− nr2t4J

2κ4 + 2κ2rt2J/3

)≤ κ2

r2t4J||f †||2H

(34κ4

n+

15κ4

n2

)+ 4κ2||f †||2H exp

(− nr2t4J

2κ4 + 2κ2rt2J/3

).

Next, we combine this bound on Bρ(gλ) with Proposition 3 to obtain

Rρ(fλ) = Bρ(gλ) + Vρ(gλ)

≤ κ2

r2t4J||f †||2H

(34κ4

n+

15κ4

n2

)+ 4κ2||f †||2H exp

(− nr2t4J

2κ4 + 2κ2rt2J/3

)

+

2J +2Jt2J+1

λ+

1

λtr(T1) +

32J

λ

{κ2

ntr(T1)

}1/2

+12κ2J

λn+κ2J

λ

(2

e

) t2Jn

2κ2

σ2

n

= O

{(||f †||2H + σ2

) Jn

},

as was to be shown. �

References

Aronszajn, N. (1950). Theory of reproducing kernels. T. A. Math. Soc. 68 337–404.

Page 35: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 35

Bach, F. (2013). Sharp analysis of low-rank kernel matrix approximations. In Conferenceon Learning Theory.

Bauer, F., Pereverzev, S. and Rosasco, L. (2007). On regularization algorithms inlearning theory. J. Complexity 23 52–72.

Belitser, E. and Levit, B. (1995). On minimax filtering over ellipsoids. Math. Meth.Statist 4 259–273.

Caponnetto, A. and De Vito, E. (2007). Optimal rates for the regularized least-squaresalgorithm. Found. Comput. Math. 7 331–368.

Caponnetto, A. and Yao, Y. (2010). Cross-validation based adaptation for regularizationoperators in learning theory. Anal. Appl. 8 161–183.

Carmeli, C., De Vito, E. and Toigo, A. (2006). Vector valued reproducing kernel Hilbertspaces of integrable functions and Mercer theorem. Anal. Appl. 4 377–408.

Conway, J. (1999). A Course in Operator Theory. American Mathematical Society.Davis, C. and Kahan, W. (1970). The rotation of eigenvectors by a perturbation. III. SIAM

J. Numer. Anal. 7 1–46.Dhillon, P., Foster, D., Kakade, S. and Ungar, L. (2013). A risk comparison of

ordinary least squares vs ridge regression. J. Mach. Learn. Res. 14 1505–1511.Dicker, L. (2015). Ridge regression and asymptotic minimax estimation over spheres of

growing dimension. Benoulli To appear.Drineas, P. and Mahoney, M. (2005). On the Nystrom method for approximating a Gram

matrix for improving kernel-based learning. J. Mach. Learn. Res. 6 2153–2175.Engl, H., Hanke, M. and Neubauer, A. (1996). Regularization of Inverse Problems.

Mathematics and Its Applications, Vol. 375, Springer.Fasshauer, G. and McCourt, M. (2012). Stable evaluation of Gaussian radial basis

function interpolants. SIAM J. Sci. Comput. 34 A737–A762.Gittens, A. and Mahoney, M. (2013). Revisiting the Nystrom method for improved

large-scale machine learning. In International Conference on Machine Learning.Golubev, Y., Levit, B. and Tsybakov, A. (1996). Asymptotically efficient estimation of

analytic functions in Gaussian noise. Bernoulli 2 167–181.Guo, Y., Bartlett, P., Shawe-Taylor, J. and Williamson, R. (2002). Covering

numbers for support vector machines. IEEE T. Inform. Theory 48 239–250.Gyorfi, L., Kohler, M., Krzyzak, A. and Walk, H. (2002). A Distribution Free Theory

of Nonparametric Regression. Springer.Hsu, D., Kakade, S. and Zhang, T. (2014). Random design analysis of ridge regression.

Found. Comput. Math. 14 569–600.Ibragimov, I. and Khas’minskii, R. (1983). Estimation of distribution density belonging

to a class of entire functions. Theor. Probab. Appl+ 27 551–562.Koltchinskii, V. (2006). Local Rademacher complexities and oracle inequalities in risk

Page 36: Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd estimators f^ that minimize the risk (1). In this paper, we focus on regularization-based

L.H. Dicker et al./Kernel methods for nonparametric regression 36

minimization. Ann. Stat. 34 2593–2656.Lo Gerfo, L., Rosasco, L., Odone, F., De Vito, E. and Verri, A. (2008). Spectral

algorithms for supervised learning. Neural Comput. 7 1873–1897.Lopez-Paz, D., Sra, S., Smola, A., Ghahramani, Z. and Scholkopf, B. (2014). Ran-

domized nonlinear component analysis. In International Conference on Machine Learning.Mathe, P. (2005). Saturation of regularization methods for linear ill-posed problems in

Hilbert spaces. SIAM J. Numer. Anal. 42 968–973.Mathe, P. and Pereverzev, S. (2002). Moduli of continuity for operator valued functions.

Numer. Func. Anal. Opt. 23 623–631.Minsker, S. (2011). On Some Extensions of Bernstein’s Inequality for Self-adjoint Operators.

ArXiv e-prints .Nemirovskii, A. (1985). Nonparametric estimation of smooth regression functions. Sov. J.

Comput. Syst. S+ 23 1–11.Nemirovskii, A., Polyak, B. and Tsybakov, A. (1983). Estimators of maximum likeli-

hood type for nonparametric regression. Dokl. Math. 28 788–792.Neubauer, A. (1997). On converse and saturation results for Tikhonov regularization of

linear ill-posed problems. SIAM J. Numer. Anal. 34 517–527.Rahimi, A. and Recht, B. (2008). Random features for large-scale kernel machines. In

Advances in Neural Information Processing Systems.Rosasco, L., De Vito, E. and Verri, A. (2005). Spectral methods for regularization in

learning theory. Tech. Rep. DISI-TR-05-18, Universita degli Studi di Genova, Italy.Steinwart, I., Hush, D. and Scovel, C. (2009). Optimal rates for regularized least

squares regression. In Conference on Learning Theory.Tropp, J. (2015). An Introduction to Matrix Concentration Inequalities. ArXiv e-prints .Tsybakov, A. (2004). Introduction a l’estimation non-parametrique. Springer.Wasserman, L. (2006). All of Nonparametric Statistics. Springer.Williams, C. and Seeger, M. (2001). Using the Nystrom method to speed up kernel

machines. In Advances in Neural Information Processing Systems.Yang, Y., Pilanci, M. and Wainwright, M. (2015). Randomized sketches for kernels:

Fast and optimal non-parametric regression. ArXiv e-prints .Zhang, T. (2005). Learning bounds for kernel regression using effective data dimensionality.

Neural Comput. 17 2077–2098.Zhang, Y., Duchi, J. and Wainwright, M. (2013). Divide and conquer kernel ridge

regression. In Conference on Learning Theory.