Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd...
Transcript of Kernel methods and regularization techniques for ...As discussed in Section1, our goal is nd...
Kernel methods and regularization techniquesfor nonparametric regression: Minimax
optimality and adaptation
Lee H. Dicker Dean P. Foster Daniel Hsu
Department of Statistics and BiostatisticsRutgers University
Piscataway, NJ 08854e-mail: [email protected]
Department of StatisticsWharton School, University of Pennsylvania
Philadelphia, PA 19104, USAe-mail: [email protected]
Department of Computer ScienceColumbia University
New York, NY 10027, USAe-mail: [email protected]
Abstract: Regularization is an essential element of virtually all kernel methods fornonparametric regression problems. A critical factor in the effectiveness of a given ker-nel method is the type of regularization that is employed. This article compares andcontrasts members from a general class of regularization techniques, which notably in-cludes ridge regression and principal component regression. We first derive risk boundsfor these techniques that match the minimax rates in several settings, using recentlarge deviations machinery and a natural bias-variance decomposition. We then showthat certain regularization techniques are more adaptable than others to favorable reg-ularity properties that the true regression function may possess. This, in particular,demonstrates a striking difference between kernel ridge regression and kernel principalcomponent regression.
Keywords and phrases: Learning theory, Principal component regression, Reproduc-ing kernel Hilbert space, Ridge regression.
1. Introduction
Suppose that the observed data consists of zi = (yi,xi), i = 1, . . . , n, where yi ∈ Y ⊆ R andxi ∈ X ⊆ Rd. Suppose further that z1, . . . , zn ∼ ρ are iid from some probability distributionρ on Y ×X . Let ρ(·|x) denote the conditional distribution of yi given xi = x ∈ X and let ρX
1
L.H. Dicker et al./Kernel methods for nonparametric regression 2
denote the marginal distribution of xi. Our goal is to use the available data to estimate theregression function of y on x,
f †(x) =
∫Yy dρ(y|x),
which minimizes the mean-squared prediction error∫Y×X{y − f(x)}2 dρ(y,x)
over ρX -measurable functions f : X → R. More specifically, for an estimator f define the risk
Rρ(f) = E[∫{f †(x)− f(x)}2 dρX (x)
]= E
(||f † − f ||2ρX
), (1)
where the expectation is computed over z1, . . . , zn and || · ||ρX denotes the norm on L2(ρX );
we seek estimators f which minimize Rρ(f).This is a version of the random design nonparametric regression problem. There is a vast
literature on nonparametric regression, along with a huge variety of corresponding meth-ods (e.g., Gyorfi et al., 2002; Wasserman, 2006). In this paper, we focus on regularization andkernel methods for estimating f † (here, we mean “kernel” as in reproducing kernel Hilbertspace, rather than kernel-smoothing, which is another popular approach to nonparametricregression). Most of our results apply to general regularization operators. However, our moti-vating examples are two well-known regularization techniques: Kernel ridge regression (whichwe refer to as “KRR”; KRR is also known as Tikhonov regularization) and kernel principalcomponent regression (“KPCR”; also known as spectral cut-off regularization).
Our main contribution is two-fold. First, we derive new risk bounds for a general class ofkernel-regularization methods, which includes both KRR and KPCR. These bounds implythat the corresponding regularization methods achieve the minimax rate for estimating f †
with respect to a variety of kernels and settings; this is illustrated by example in Corollaries1–4 of Section 6.1. In each example, the minimax rate is determined by the eigenvalues of thekernel operator, computed with respect to ρX . Second, we show precisely how some types ofregularization operators are able to adapt to the regularity of f †, while others are known tohave more limitations in this regard and suffer from the saturation effect (Bauer et al., 2007;Mathe, 2005; Neubauer, 1997). More specifically, we show that if f † has an especially simplifiedexpansion in the eigenbasis of the kernel operator, then some regularization estimators areable to leverage this to obtain even faster (minimax) convergence rates. As a consequenceof our adaptation results, we conclude that KPCR is fully minimax adaptive in a range ofsettings, while KRR is known to saturate. This illustrates a striking advantage that KPCRmay have over KRR in these settings.
L.H. Dicker et al./Kernel methods for nonparametric regression 3
Related Work
Kernel ridge regression and other regularization methods have been widely studied. Indeed, itis well-known that KRR is minimax in many of the settings we consider in this paper, such asthose described in Corollaries 1–4 below (Caponnetto and De Vito, 2007; Zhang, 2005; Zhanget al., 2013). However, our risk bounds and corresponding minimax results apply to moregeneral regularization operators (not just KRR/Tikhonov regularization), which appears tobe new. Existing risk bounds for other regularization operators tend to be looser (Bauer et al.,2007) (see the comments after Theorem 1 below) or require auxiliary information (e.g., thebounds in (Caponnetto and Yao, 2010) apply to semi-supervised settings where an additionalpool of unlabeled data is available).
Others have also observed that KPCR may have advantages over KRR, as discussed above.Indeed, others have even observed that Tikhonov regularization (KRR) saturates, while spec-tral cut-off regularization (KPCR) does not (Bauer et al., 2007; Lo Gerfo et al., 2008; Mathe,2005). Our results (Propositions 4–5) extend beyond this observation to precisely quantify theadvantages of unsaturated regularization operators in terms of adaptation and minimaxity. Inother related work, Dhillon et al. (2013) have illustrated the potential advantages of KPCRover KRR in finite-dimensional problems with linear kernels; though their work is not framedin terms of saturation and general regularization operators, it essentially relies on similarconcepts.
The main engine behind the technical results in this paper is a collection of large-deviationresults for Hilbert-Schmidt operators. The required machinery is developed in Appendix A(especially Lemmas 3–5 and Corollary 5); these results build on straightforward extensions ofresults from (Tropp, 2015). Additionally, our Proposition 5, on finite-rank adaptation, relies onwell-known eigenvalue perturbation results that have been adapted to handle Hilbert-Schmidtoperators (e.g. the Davis-Kahan sin Θ theorem (Davis and Kahan, 1970)).
2. Statistical Setting and Assumptions
Our basic assumption on the distribution of z = (y,x) ∼ ρ is that the residual variance isbounded; more specifically, we assume that there exists a constant σ2 > 0 such that∫
Y{y − f †(x)}2 dρ(y|x) ≤ σ2 (2)
for almost all x ∈ X . Zhang et al. (2013) also assume (2); this assumption is slightly weakerthan the analogous assumption in (Bauer et al., 2007) (equation (1) in their paper). Notethat (2) holds if y is bounded almost surely.
L.H. Dicker et al./Kernel methods for nonparametric regression 4
Let K : X × X → R be a symmetric positive-definite kernel function. We assume that Kis bounded, i.e., that there exists κ2 > 0 such that
supx∈X
K(x,x) ≤ κ2.
Additionally, we assume that there is a countable basis of eigenfunctions {ψj}∞j=1 ⊆ L2(ρX )and a sequence of corresponding eigenvalues t21 ≥ t22 ≥ · · · ≥ 0 such that
K(x, x) =∞∑j=1
t2jψj(x)ψj(x), x, x ∈ X (3)
and the convergence is absolute. Mercer’s theorem and various generalizations give conditionsunder which representations like (3) are known to hold (Carmeli et al., 2006); one of thesimplest examples is when X is a compact Hausdorff space, ρX is a probability measure onthe Borel sets of X , and K is continuous. Observe that
∞∑j=1
t2j =∞∑j=1
t2j
∫Xψj(x)2 dρX (x) =
∫XK(x,x) dρX (x) ≤ κ2;
in particular, {t2j} ∈ `1(N).Let H ⊆ L2(ρX ) be the reproducing kernel Hilbert space (RKHS) corresponding to K
(Aronszajn, 1950) and letφj = tjψj, j = 1, 2, . . . (4)
It follows from basic facts about RKHSs that {φj}∞j=1 is an orthonormal basis for H (ift2J > t2J+1 = 0, then {φj}Jj=1 is an orthonormal basis for H). Our main assumption on therelationship between y, x, and the kernel K is that
f † ∈ H. (5)
This is a regularity or smoothness assumption on f †. We represent f † in the bases {ψj}∞j=1
and {φj}∞j=1 as
f † =∞∑j=1
γjψj =∞∑j=1
βjφj, (6)
where βj = γj/tj. Then the assumption (5) is equivalent to
∞∑j=1
β2j =
∞∑j=1
γ2jt2j<∞.
L.H. Dicker et al./Kernel methods for nonparametric regression 5
Many of the results in this paper can be modified, so that they apply to settings where f † /∈ H,by replacing f † with an appropriate projection of f † onto H and including an approximationerror term in the corresponding bounds. This approach leads to the study of oracle inequalities(Hsu et al., 2014; Koltchinskii, 2006; Steinwart et al., 2009; Zhang, 2005; Zhang et al., 2013),which we do not pursue in detail here. However, investigating oracle inequalities for generalregularization operators may be of interest for future research, as most existing work focuseson ridge regularization.
3. Regularization
As discussed in Section 1, our goal is find estimators f that minimize the risk (1). In thispaper, we focus on regularization-based estimators for f †. In order to precisely describe theseestimators, we require some additional notation for various operators that will be of interest,and some basic definitions from regularization theory.
3.1. Finite-Rank Operators of Interest
For x ∈ X , define Kx ∈ H by Kx(x) = K(x, x), x ∈ X , and let X = (x1, . . . ,xn)> ∈ Rn×p
and y = (y1, . . . , yn)> ∈ Rn. Additionally, define the finite-rank linear operators SX : H → Rn
and TX : H → H (both depending on X) by
SXφ = (〈φ,Kx1〉H, . . . , 〈φ,Kxn〉H)> = (φ(x1), . . . , φ(xn))>,
TXφ =1
n
n∑i=1
〈φ,Kxi〉HKxi =1
n
n∑i=1
φ(xi)Kxi ,
where 〈·, ·〉H is the inner-product on H and φ ∈ H. Let 〈·, ·〉Rn denote the normalized inner-product on Rn, defined by 〈v, v〉Rn = n−1v>v for v = (v1, . . . , vn)>, v = (v1, . . . , vn)> ∈ Rn.Then the adjoint of SX with respect to 〈·, ·〉H and 〈·, ·〉Rn , S∗X : Rn → H, is given by
S∗Xv =1
n
n∑i=1
viKxi .
Additionally, TX = S∗XSX . Finally, observe that SXS∗X : Rn → Rn is given by the n×n matrix
SXS∗X = n−1K, where K = (K(xi,xj))1≤i,j≤n; K is the kernel matrix, which is ubiquitous in
kernel methods and enables finite computation.
L.H. Dicker et al./Kernel methods for nonparametric regression 6
3.2. Basic Definitions
A family of functions gλ : [0,∞) → [0,∞) indexed by λ > 0 is called a regularization familyif it satisfies the following three conditions:
(R1) sup0<t≤κ2 |tgλ(t)| < 1.(R2) sup0<t≤κ2 |1− tgλ(t)| ≤ 1.(R3) sup0<t≤κ2 |gλ(t)| < λ−1.
(This definition follows (Bauer et al., 2007; Engl et al., 1996), but is slightly more restrictive.)The main idea behind a regularization family is that it “looks” similar to g(t) = 1/t, but isbetter-behaved near t = 0, i.e., it is bounded by λ−1. An important quantity that is related tothe adaptability and saturation point of a regularization family (mentioned in Section 1), isthe qualification of the regularization. The qualification of the regularization family {gλ}λ>0
is defined to be the maximal ξ ≥ 0 such that
sup0<t≤κ2
|1− tgλ(t)|tξ ≤ λξ. (7)
The two primary examples of regularization considered in this paper are ridge (Tikhonov)regularization, where
gλ(t) = rλ(t) =1
t+ λ
and principal component (spectral cut-off) regularization, where
gλ(t) = sλ(t) =1
tI{t ≥ λ}. (8)
Observe that ridge regularization has qualification 1 and principal component regularizationhas qualification ∞. Another example of a regularization family is the Landweber iteration,which can be viewed as a special case of gradient descent (see, e.g., Bauer et al., 2007; LoGerfo et al., 2008; Rosasco et al., 2005). Throughout the paper, all regularization families areassumed to have qualification at least 1.
3.3. Estimators
Given a regularization family {gλ}λ>0, we define the gλ-regularized estimators for f †,
fλ = gλ(TX)S∗Xy. (9)
Here, gλ acts on the spectrum (eigenvalues) of the finite-rank operator TX . The dependenceof fλ on the regularization family is implicit; our results hold for any regularization family
L.H. Dicker et al./Kernel methods for nonparametric regression 7
except where explicitly stated (in particular, Section 6.2). The estimators fλ are the mainfocus of this paper.
To provide some intuition behind fλ, define the positive self-adjoint operator T : H → Hby
Tφ =
∫X〈φ,Kx〉HKx dρX (x) =
∞∑j=1
t2j〈φ, φj〉Hφj, φ ∈ H.
Observe that T is a “population” version of the operator TX . Unlike TX , T often has infiniterank; however, we still might expect that
T ≈ TX (10)
for large n (where the approximation holds in some suitable sense).We also have a large-n approximation for S∗Xy. For φ ∈ H,
〈φ, S∗Xy〉H =1
n
n∑i=1
yi〈φ,Kxi〉H ≈∫Y×X
yφ(x) dρ(y,x)
=
∫Xf †(x)φ(x) dρX (x) = 〈φ, f †〉L2(ρX ) = 〈φ, Tf †〉H,
where 〈·, ·〉L2(ρX ) denotes the inner-product on L2(ρX ) and we have used (4) to obtain thelast equality. It follows that S∗Xy ≈ Tf †. Hence, to recover f † from y, it would be natural toapply the inverse of T to S∗Xy. However, T is not invertible whenever it has infinite rank, andregularization becomes necessary. We thus arrive at the chain of approximations which helpmotivate fλ:
fλ = gλ(TX)S∗Xy ≈ gλ(T )Tf † ≈ f †,
where gλ(T ) may be viewed as an approximate inverse for a suitably chosen regularizationparameter λ. The rest of the paper is devoted to deriving bounds on the risk
Rρ(fλ) = E(||f † − fλ||2ρX
)= E
{||f † − gλ(TX)S∗Xy||2ρX
}(11)
and investigating the effect of specific regularization families {gλ}λ>0 on the accuracy of fλ.
4. Preliminary Simplifications
The main results in this paper involve bounding (11). This section is devoted to deriving somepreliminary simplifications of (11).
L.H. Dicker et al./Kernel methods for nonparametric regression 8
By (4), the L2(ρX )-norm in (11) can be converted into a norm on H; indeed, we haveRρ(fλ) = E{||T 1/2(f † − fλ)||2H}. Next, define the residuals εi = yi − f †(xi), i = 1, . . . , n, and
let ε = (ε1, . . . , εn)>. Then fλ = gλ(TX)TXf† + gλ(TX)S∗Xε and it follows that
Rρ(fλ) = E[||T 1/2{I − gλ(TX)TX}f † − T 1/2gλ(TX)S∗Xε||2H
]= E
[||T 1/2{I − gλ(TX)TX}f †||2H
]+ E
{||T 1/2gλ(TX)S∗Xε||2H
}. (12)
This is a bias/variance decomposition of the risk Rρ(fλ); the first term in (12) represents the
bias of fλ and the second term represents the variance.In order to further simplify (12), we first note that the Hilbert space H is isometric to `2(N)
via the isometry ι : H → `2(N), given by
ι :∞∑j=1
αjφj 7→ (α1, α2, . . . )> (13)
(if H is finite dimensional and t2J > t2J+1 = 0, then take 0 = αJ+1 = αJ+2 = · · · ; we take allelements of `2(N) to be infinite-dimensional column vectors). Using this equivalence, we canconvert elements of H and linear operators on H appearing in (12) into (infinite-dimensional)vectors and matrices, respectively, which we find simpler to analyze in the sequel.
Before proceeding, we introduce some more notation. Define the infinite-dimensional diag-onal matrix T = diag(t21, t
22, . . . ) and let β = (β1, β2, . . . )
> ∈ `2(N). Next define the n ×∞matrices Ψ = (ψj(xi))1≤i≤n;1≤j<∞ and Φ = ΨT = (φj(xi))1≤i≤n;1≤j<∞. Observe that
SX = Φ ◦ ι, (14)
S∗X = ι−1 ◦(
1
nΦ>
), (15)
T = ι−1 ◦ T ◦ ι. (16)
Finally, let || · || = || · ||`2(N) denote the norm on `2(N).Combining (12)–(16) yields the next proposition, which is the starting point for obtaining
risk bounds in the sections that follow.
Proposition 1. Suppose that fλ is the estimator defined in (9). Then
Rρ(fλ) = Bρ(gλ) + Vρ(gλ),
L.H. Dicker et al./Kernel methods for nonparametric regression 9
where
Bρ(gλ) = B(n)ρ (gλ) = E
[∣∣∣∣∣∣∣∣T 1/2
{I − gλ
(1
nΦ>Φ
)1
nΦ>Φ
}β
∣∣∣∣∣∣∣∣2],
Vρ(gλ) = V(n)ρ (gλ) =
1
n2E
{∣∣∣∣∣∣∣∣T 1/2gλ
(1
nΦ>Φ
)Φ>ε
∣∣∣∣∣∣∣∣2}.
Proposition 1 is a further simplified version of the bias/variance decomposition (12); Bρ(gλ)and Vρ(gλ) are the bias and variance terms, respectively.
5. Main Result
In this section, we give separate bounds on the bias and variance terms from Proposition1, which are then combined to yield our main result bounding Rρ(fλ). Three major sourcescontribute to our upper bound on the bias in Proposition 2 below (and similarly to our boundon the variance in Proposition 3): Two types of approximation error, and what we refer to asthe “intrinsic bias” (or “intrinsic variance”).
The two types of approximation error correspond to finite- and infinite-dimensional compo-nents, which arise from the approximation (10). The intrinsic bias and variance are determinedby the regularization family {gλ}λ>0 and the specific regularization parameter λ used to com-pute fλ. A schematic for our risk bound decomposition may be found in Figure 1. Note,however, that while the bias/variance decomposition of Rρ(fλ) is an unambiguous additiveidentity, attribution of the various sources of error in Propositions 2–3 is somewhat moresubtle. Still, we view this as a useful heuristic.
Rρ(fλ)
Vρ(gλ)
Approx. Error
Infinite-dim.Finite-dim.
Intrinsic Variance
Bρ(gλ)
Approx. Error
Infinite-dim.Finite-dim.
Intrinsic Bias
Fig 1. Schematic for decomposition of risk bound.
In vector-matrix notation (i.e., working in `2(N) under the isometry (13)) the approximation(10) can be rewritten as
1
nΦ>Φ ≈ T . (17)
L.H. Dicker et al./Kernel methods for nonparametric regression 10
The approximation error in (17) is essentially the source of the approximation error termsin the schematic, Figure 1. Controlling this error is a key piece of our strategy for provingPropositions 2–3; it relies on large deviations results for random matrices and Hilbert-Schmidtoperators, which have been developed in (Minsker, 2011; Tropp, 2015). The necessary technicallemmas for our bounds are derived in Appendix A.
Before proceeding, we aim to provide more intuition behind the finite-dimensional andinfinite-dimensional approximation error cited in Figure 1. Note that the most direct approachto bounding the overall approximation error in (17) is to bound
δ =
∣∣∣∣∣∣∣∣ 1nΦ>Φ− T∣∣∣∣∣∣∣∣ , (18)
where || · || denotes the operator norm (this is straightforward using the results of Minsker(2011) and Tropp (2015)). However, in order to obtain improved bounds on Bρ(gλ) and Vρ(gλ),we decompose n−1Φ>Φ−T into a finite-dimensional block and an infinite-dimensional “tail,”and consider these terms separately. More specifically, let J ∈ N be a fixed positive integerand define the block decompositions
Φ = (Φ0 Φ1) , T =
(T0 00 T1
),
where Φ0 = (φj(xi))1≤i≤n;1≤j≤J , Φ1 = (φj(xi))1≤i≤n;J+1≤j<∞, T0 = diag(t21, . . . , t2J), and T1 =
diag(t2J+1, t2J+2, . . . ). Additionally, define
δ0 =
∣∣∣∣∣∣∣∣ 1nΦ>0 Φ0 − T0
∣∣∣∣∣∣∣∣ , δ1 =
∣∣∣∣∣∣∣∣( 0 1nΦ>
0 Φ11nΦ>
1 Φ01nΦ>
1 Φ1 − T1
)∣∣∣∣∣∣∣∣ . (19)
Observe that
1
nΦ>Φ− T =
(1nΦ>
0 Φ0 − T0 00 0
)+
(0 1
nΦ>
0 Φ11nΦ>
1 Φ01nΦ>
1 Φ1 − T1
)(20)
and δ ≤ δ0 + δ1. The matrix n−1Φ>0 Φ0−T0 has dimension J ×J and contributes to the finite-
dimensional approximation error in (17); δ1 reflects the infinite-dimensional approximationerror.
Explicit bounds on the bias and variance terms Bρ(gλ) and Vρ(gλ) are given in Propositions2–3 below. While the connection between these bounds and δ0, δ1 may not be immediatelytransparent from the statement of the results, it is developed more fully in the correspondingproofs (found in Appendix B).
L.H. Dicker et al./Kernel methods for nonparametric regression 11
Proposition 2. Let Bρ(gλ) be the bias term given in Proposition 1 and suppose that {τ 2j }∞j=1
is a sequence of non-negative real numbers satisfying
supJ∈N, x∈X
τ 2J
J∑j=1
ψj(x)2 ≤ 1. (21)
(Observe that τ 2j = t2j/κ2 always satisfies this condition.) Then
Bρ(gλ) ≤
12
t2Jλ2 +
408κ2
nt2Jtr(T1) +
180κ4
nt2J+ 13t2J+1 + κ2J
(2
e
) τ2Jn
2
||f †||2H.Proposition 3. Let Vρ(gλ) be the variance term given in Proposition 1 and suppose that{τ 2j }∞j=1 is a sequence of non-negative real numbers satisfying (21). Then
Vρ(gλ) ≤
2J +2Jt2J+1
λ+
1
λtr(T1) +
32J
λ
{κ2
ntr(T1)
}1/2
+12κ2J
λn+κ2J
λ
(2
e
) τ2Jn
2
σ2
n.
As mentioned above, it is a fairly subtle task to attribute each term in the upper boundsof Propositions 2-3 to a certain type of approximation error or the intrinsic bias or variance.However, a simple rule of thumb is that each appearance of λ in the upper bounds is relatedto the intrinsic bias or variance; t2J and J are related to the finite-dimensional approximationerror; and t2J+1 and tr(T1) =
∑j>J t
2j are related to the infinite-dimensional approximation
error. Propositions 1–3 are easily combined to obtain our main theorem.
Theorem 1. Suppose that fλ is the estimator defined in (9) and suppose that {τ 2j }∞j=1 is asequence of non-negative real numbers satisfying (21). Then for all positive integers J ∈ N,
Rρ(fλ) ≤
2J +2Jt2J+1
λ+
1
λtr(T1) +
32J
λ
{κ2
ntr(T1)
}1/2
+12κ2J
λn+κ2J
λ
(2
e
) τ2Jn
2
σ2
n
+
12
t2Jλ2 +
408κ2
nt2Jtr(T1) +
180κ4
nt2J+ 13t2J+1 + κ2J
(2
e
) τ2Jn
2
||f †||2H.The constants in the bounds from Propositions 2–3 (and hence Theorem 1) have not been
optimized and can likely be improved. However, we will see in the next section that Theorem1 implies the risk of fλ achieves the minimax rate in a variety of important examples. Bauer
L.H. Dicker et al./Kernel methods for nonparametric regression 12
et al. (2007) have previously obtained risk bounds on general regularization estimators similarto fλ. However, their bounds (e.g. Theorem 10 in (Bauer et al., 2007)) are independent of theambient RKHS H, i.e. they do not depend on the eigenvalues {t2j}. Our bounds are tighterthan those in (Bauer et al., 2007) because we take advantage of the structure of H. In contrastwith our Theorem 1, Bauer et al.’s (2007) results do not give minimax bounds (not easily, atleast), because minimax rates must depend on the t2j (see Corollaries 1–4 below).
6. Consequences
In this section, we use Theorem 1 to derive minimax optimal rates for all regularization fam-ilies, and to demonstrate the adaptation properties for regularization families with sufficientqualification.
6.1. Minimax Optimality
For 0 < r <∞, define
BH(r) =
{f =
∞∑j=1
αjψj ∈ L2(ρX );∞∑j=1
α2j
t2j≤ r2
}= {f ∈ H; ||f ||2H ≤ r2} ⊆ H.
The set BH(r) can be interpreted as an ellipsoid in L2(ρX ) or a ball in H. There is anextensive literature in statistics on nonparametric regression problems and minimax rates forestimating f † over ellipsoids of the form BH(r) (e.g., Nemirovskii, 1985; Nemirovskii et al.,1983; Tsybakov, 2004). These rates are typically determined by some condition on the rate ofdecay of t21 ≥ t22 ≥ · · · ≥ 0. In the classical statistics literature, the basis {ψj}∞j=1 and sequence{tj}∞j=1 are given directly, while in the present setting both are derived from the interactionbetween the kernel K and the distribution ρX . In this section, we show that for severalcommonly studied decay conditions on {t2j}∞j=1 and appropriate choice of the regularization
parameter λ, the estimator fλ achieves the minimax rate for estimating f † ∈ BH(r), asn→∞. Our results are summarized in the four corollaries to Theorem 1 given below; proofsmay be found in Appendix B.
Corollary 1 (Polynomial-decay kernels). Suppose there are constants C > 0 and ν > 1/2
such that 0 < t2j ≤ Cj−2ν for all j = 1, 2, . . . . Let λ = n−2ν
2ν+1 . Then
Rρ(fλ) = O{(||f †||2H + σ2
)n−
2ν2ν+1
}.
L.H. Dicker et al./Kernel methods for nonparametric regression 13
Corollary 2 (Exponential-decay kernels). Suppose there are constants C, α > 0 such that0 < t2j ≤ Ce−αj for all j = 1, 2, . . . . Let λ = n−1 log(n). Then
Rρ(fλ) = O
{(||f †||2H + σ2
) log(n)
n
}.
Corollary 3 (Gaussian-decay kernels). Suppose there are constants C, α > 0 such that 0 <t2j ≤ Ce−αj
2for all j = 1, 2, . . . . Additionally, assume that
supJ∈N, x∈X
C−1e−αJJ∑j=1
ψj(x)2 ≤ 1. (22)
Let λ = n−1 log(n)1/2. Then
Rρ(fλ) = O
{(||f †||2H + σ2
) log(n)1/2
n
}.
Corollary 4 (Finite rank kernels). Suppose that 0 = t2J+1 = t2J+2 = · · · . Let λ = n−1. Then
Rρ(fλ) = O
{(||f †||2H + σ2
) Jn
}.
In Corollaries 1–4, the implicit constants in the big-O bound may depend on K and ρX ,but not on f †, σ2, and n. In fact, Theorem 1 can be used to derive explicit bounds on therisk Rρ(fλ) in Corollaries 1–4 that elucidate the dependence on the kernel; however, suchresults are not reported here for the sake of simplicity. The upper bounds in Corollaries1–4 immediately yield minimax optimality results over ellipsoids BH(r) ⊆ L2(ρX ) for thecorresponding kernels and eigenvalue-decay conditions (Belitser and Levit, 1995; Caponnettoand De Vito, 2007; Golubev et al., 1996; Ibragimov and Khas’minskii, 1983). It is noteworthythat the polynomial decay kernels and corresponding RKHSs in Corollary 1 correspond toSobolev spaces, which are of fundamental importance in nonparametric statistics and appliedmathematics (Tsybakov, 2004). Additionally, Corollaries 2 and 3 are particularly relevant forthe popular Gaussian kernels of the form K(x, x) = exp(−α2||x − x||2), α > 0 (Fasshauerand McCourt, 2012; Guo et al., 2002). Note that the additional condition (22) in Corollary3 is required to ensure that the finite-dimensional approximation error (discussed in Section5) decays fast enough. Condition (22) holds if, for instance, the eigenfunctions {ψj} areuniformly bounded. If (22) does not hold, one may obtain slightly weaker optimality resultsfor Gaussian-decay kernels where ||f † − fλ||2ρX converges at the minimax rate in probability,rather than expectation (Caponnetto and De Vito, 2007).
L.H. Dicker et al./Kernel methods for nonparametric regression 14
6.2. Adaptation
Our results in the previous sections apply to all regularization families satisfying conditions(R1)–(R3). In this section, we derive results that depend on the qualification of the regular-ization (7) and the specific regularization family.
A family of estimators {fλ}∞j=1 is adaptive if it can achieve the minimax rate for estimatingf † over a wide range of regions or subsets ofH. The qualification of the regularization is relatedhow much a given regularization family can adapt to the regularity of the signal f †; generally,higher qualification corresponds to better adaptation (recall that ridge regularization hasqualification 1 and principal component regularization has qualification ∞). In Section 6.2.1,we show that regularization families with sufficiently high qualification are minimax overBH0(r) ⊆ H0 for Hilbert spaces H0 ⊆ H, which consist of functions that are “even moreregular” (e.g., more smooth) than those in H. In Section 6.2.2, we restrict our attentionto principal component regularization and show that KPCR can effectively adapt to finite-rank signals. On the other hand, regularization families with lower qualification (e.g., ridgeregularization) are known to saturate and the corresponding estimators are sub-optimal oversubsets corresponding to highly-regular signals in many instances (Dhillon et al., 2013; Dicker,2015; Mathe, 2005).
6.2.1. Adaptation in Sobolev Spaces (Polynomial-Decay Eigenvalues)
In this section, we assume that
t2j = j−2ν , j = 1, 2, . . . (23)
for some ν > 1/2. Let f =∑∞
j=1 θjψj ∈ L2(ρX ). Then f ∈ H if and only if∑∞
j=1 j2νθ2j <∞.
For µ ≥ 0, define
Hµ =
{f =
∞∑j=1
θjψj ∈ L2(ρX );∞∑j=1
j2ν(µ+1)θ2j <∞
}⊆ H.
The Hµ is itself a Hilbert space with norm ||f ||2Hµ =∑∞
j=1 j2ν(µ+1)θ2j . Observe that H = H0
and || · ||H = || · ||H0 . Corollary 1 implies that fλ is minimax over BH(r) for any regularizationfamily {gλ}λ>0. The next result, which is proved in Appendix B, implies that fλ is minimaxover BHµ(r) ⊆ Hµ ⊆ H, provided the qualification of {gλ}λ>0 is at least dµ/2 + 1e.Proposition 4. Assume that (23) holds and that the qualification of {gλ}λ≥0 is at least
dµ/2 + 1e. Let µ > 0. If f † ∈ Hµ and λ = n−2ν
2ν(µ+1)+1 , then
Rρ(fλ) = O{(||f †||2Hµ + σ2
)n−
2ν(µ+1)2ν(µ+1)+1
}.
L.H. Dicker et al./Kernel methods for nonparametric regression 15
As in Corollaries 1–4, the constants in the big-O bounds in this proposition and the nextmay depend on the kernel K. Since Hµ2 ⊆ Hµ1 for µ1 ≤ µ2, we see that greater adaptation ispossible via Proposition 4 for regularization families with higher qualification. Observe thatProposition 4 does not apply to KRR for any positive µ > 0, while it applies to KPCR for allµ ≥ 0.
6.2.2. Adaptation to Finite-Dimensional Signals
In this section, we focus on KPCR and signals (6) with 0 = βJ+1 = βJ+2 = · · · .
Proposition 5. Let fλ be the KPCR estimator with principal component regularization (8)and assume that f † =
∑Jj=1 βjφj ∈ H. Fix 0 < r < 1. If t−2J = O(n1/2), (Jt4J)−1 = O(1), and
λ = (1− r)t2J , then
Rρ(fλ) = O
{(||f †||2H + σ2
) Jn
}.
The upper bound in Proposition 5 matches the bound in Corollary 4. However, Corollary 4relies on the structure of the kernel (i.e., the decay of the eigenvalues {t2j}∞j=1), while Proposi-tion 5 takes advantage of the structure of the signal f †. It follows from Proposition 5 that theKPCR estimator is minimax rate-optimal over J-dimensional signals in H for a very broadclass of kernels. On the other hand, it is known that KRR may perform dramatically worsethan KPCR in problems where f † lies in low-dimensional subspaces of the ambient space Hdue to the saturation effect (see, for example, (Dhillon et al., 2013)).
7. Discussion
Our unified analysis for a general class of regularization families in nonparametric regressionhighlights two important statistical properties. First, the results show minimax optimality forthis general class in several commonly studied settings, which was only previously establishedfor specific regularization methods. Second, the results demonstrate the adaptivity of certainregularization families to sub-ellipsoids of the RKHS, showing that these techniques may takeadvantage of additional smoothness properties that the signal may possess. It is notable thatthe most well-studied family, KRR/Tikhonov regularization, does not possess this adaptationproperty.
The minimax optimality results were obtained using a priori settings of the regularizationparameter λ that depend on the rate of decay of the eigenvalues corresponding to the kernelK and the distribution ρX . A data-driven estimator that chooses λ based on cross validationwas proposed for a different class of regularized estimators by Caponnetto and Yao (2010) in asemi-supervised setting. Adapting a similar technique to the estimator fλ studied in our work
L.H. Dicker et al./Kernel methods for nonparametric regression 16
may be needed to fully take advantage of the adaptivity of sufficiently qualified regularizationfamilies.
Kernel methods were popularized as a computational “trick” for coping with high- and eveninfinite-dimensional feature spaces. However, they are now often regarded as computationallyprohibitive for many very large data problems on account of their standard implementations’worst-case O(n3) time and O(n2) space complexity. Approximation methods for coping withthis computational complexity include Nystrom approximation (Drineas and Mahoney, 2005;Williams and Seeger, 2001), matrix sketching (Gittens and Mahoney, 2013), and randomfeatures (Lopez-Paz et al., 2014; Rahimi and Recht, 2008). Recent work has considered theeffect of these approximations on the statistical behavior of KRR (Bach, 2013; Yang et al.,2015). It is natural to also consider these approximations in the context of other regularizationfamilies—especially KPCR, given its connection to low-rank approximations.
Appendix A: Lemmas Required for Results in the Main Text
As in Propositions 2–3, let J ∈ N be a fixed positive integer. Define the n × J matrixΨ0 = (ψj(xi))1≤i≤n;1≤j≤J and let λmin(n−1Ψ>
0 Ψ0) be the smallest eigenvalue of n−1Ψ>0 Ψ0.
Additionally define the event
A(δ) =
{λmin
(1
nΨ>
0 Ψ0
)≥ δ
}.
The next two lemmas are key to proving Propositions 2 and 3, respectively.
Lemma 1. On the event A(1/2),∣∣∣∣∣∣∣∣T 1/2
{I − gλ
(1
nΦ>Φ
)1
nΦ>Φ
}β
∣∣∣∣∣∣∣∣2 ≤ (12
t2Jλ2 +
12
t2Jδ21 + 13t2J+1
)||f †||2H,
where δ1 is defined in (19)
Proof. Let
IB =
∣∣∣∣∣∣∣∣T 1/2
{I − gλ
(1
nΦ>Φ
)1
nΦ>Φ
}β
∣∣∣∣∣∣∣∣2 .
L.H. Dicker et al./Kernel methods for nonparametric regression 17
Then
IB =
∣∣∣∣∣∣∣∣( T 1/20 00 0
){gλ
(1
nΦ>Φ
)1
nΦ>Φ− I
}β
∣∣∣∣∣∣∣∣2+
∣∣∣∣∣∣∣∣( 0 0
0 T 1/21
){gλ
(1
nΦ>Φ
)1
nΦ>Φ− I
}β
∣∣∣∣∣∣∣∣2≤ IB0 + t2J+1
∣∣∣∣∣∣∣∣gλ( 1
nΦ>Φ
)1
nΦ>Φ− I
∣∣∣∣∣∣∣∣2 ||f †||2H≤ IB0 + t2J+1||f †||2H, (24)
where the second-to-last inequality uses the fact that ||β||2 = ||f †||2H, the last bound followsfrom (R2), and
IB0 =
∣∣∣∣∣∣∣∣( T 1/20 00 0
){gλ
(1
nΦ>Φ
)1
nΦ>Φ− I
}β
∣∣∣∣∣∣∣∣2 .Since we are on the event A(1/2),
IB0 =
∣∣∣∣∣∣∣∣( T 1/20
(1nΦ>
0 Φ0
)−10
0 0
)(1nΦ>
0 Φ0 00 0
){gλ
(1
nΦ>Φ
)1
nΦ>Φ− I
}β
∣∣∣∣∣∣∣∣2≤
∣∣∣∣∣∣∣∣∣∣T 1/2
0
(1
nΦ>
0 Φ0
)−1∣∣∣∣∣∣∣∣∣∣2 ∣∣∣∣∣∣∣∣( 1
nΦ>
0 Φ0 00 0
){gλ
(1
nΦ>Φ
)1
nΦ>Φ− I
}β
∣∣∣∣∣∣∣∣2≤ 4
t2J
∣∣∣∣∣∣∣∣( 1nΦ>
0 Φ0 00 0
){gλ
(1
nΦ>Φ
)1
nΦ>Φ− I
}β
∣∣∣∣∣∣∣∣2≤ 12
t2J
∣∣∣∣∣∣∣∣( 1
nΦ>Φ
){gλ
(1
nΦ>Φ
)1
nΦ>Φ− I
}β
∣∣∣∣∣∣∣∣2+
12
t2J
∣∣∣∣∣∣∣∣( 0 1nΦ>
0 Φ11nΦ>
1 Φ01nΦ>
1 Φ1 − T1
){gλ
(1
nΦ>Φ
)1
nΦ>Φ− I
}β
∣∣∣∣∣∣∣∣2+
12
t2J
∣∣∣∣∣∣∣∣( 0 00 T1
){gλ
(1
nΦ>Φ
)1
nΦ>Φ− I
}β
∣∣∣∣∣∣∣∣2≤{
12
t2J(λ2 + δ21) + 12t2J+1
}||f †||2H
The lemma follows by combining this bound on IB0 with (24).
L.H. Dicker et al./Kernel methods for nonparametric regression 18
Lemma 2. On the event A(1/2),
1
n2E
{∣∣∣∣∣∣∣∣T 1/2gλ
(1
nΦ>Φ
)Φ>ε
∣∣∣∣∣∣∣∣2∣∣∣∣∣X}≤{
2J +2J
λ(δ1 + t2J+1) +
1
λtr(T1)
}σ2
n.
Proof. Suppose we are on the event A(1/2) and let
1
n2E
{∣∣∣∣∣∣∣∣T 1/2gλ
(1
nΦ>Φ
)Φ>ε
∣∣∣∣∣∣∣∣2∣∣∣∣∣X}
= IV0 + IV1, (25)
where
IV0 =1
n2E
{∣∣∣∣∣∣∣∣( T 1/20 00 0
)gλ
(1
nΦ>Φ
)Φ>ε
∣∣∣∣∣∣∣∣2∣∣∣∣∣X},
IV1 =1
n2E
{∣∣∣∣∣∣∣∣( 0 0
0 T 1/21
)gλ
(1
nΦ>Φ
)Φ>ε
∣∣∣∣∣∣∣∣2∣∣∣∣∣X}.
We bound IV0 and IV1 separately.To bound IV0, first we use (2) to obtain
IV0 ≤σ2
n2tr
{Φgλ
(1
nΦ>Φ
)(T0 00 0
)gλ
(1
nΦ>Φ
)Φ>
}. (26)
If A,B are trace class operators, then tr(AB) = tr(BA); if, furthermore, they are positive andself-adjoint, then tr(AB) ≤ ||A||tr(B) (see, for example, Theorem 18.11 of (Conway, 1999)).
L.H. Dicker et al./Kernel methods for nonparametric regression 19
Thus,
σ2
n2tr
{Φgλ
(1
nΦ>Φ
)(T0 00 0
)gλ
(1
nΦ>Φ
)Φ>
}=σ2
ntr
{gλ
(1
nΦ>Φ
)2(1
nΦ>Φ
)(T0 00 0
)}
=σ2
ntr
{( (1nΦ>
0 Φ0
)1/20
0 0
)gλ
(1
nΦ>Φ
)2(1
nΦ>Φ
)( (1nΦ>
0 Φ0
)1/20
0 0
)·( (
1nΦ>
0 Φ0
)−1/2 T0 ( 1nΦ>0 Φ0
)−1/20
0 0
)}≤ σ2
n
∣∣∣∣∣∣∣∣∣∣( (
1nΦ>
0 Φ0
)1/20
0 0
)gλ
(1
nΦ>Φ
)2(1
nΦ>Φ
)( (1nΦ>
0 Φ0
)1/20
0 0
)∣∣∣∣∣∣∣∣∣∣
· tr
{T0(
1
nΦ>
0 Φ0
)−1}
=σ2
n
∣∣∣∣∣∣∣∣ 1√n
Φgλ
(1
nΦ>Φ
)(1nΦ>
0 Φ0 00 0
)gλ
(1
nΦ>Φ
)1√n
Φ>
∣∣∣∣∣∣∣∣· tr
{T0(
1
nΦ>
0 Φ0
)−1}
≤ 2Jσ2
n
∣∣∣∣∣∣∣∣ 1√n
Φgλ
(1
nΦ>Φ
)(1nΦ>
0 Φ0 00 0
)gλ
(1
nΦ>Φ
)1√n
Φ>
∣∣∣∣∣∣∣∣ ,where the last inequality follows from the fact that we are on the event A(1/2). Combiningthis with (26) yields
IV0 ≤2Jσ2
n
∣∣∣∣∣∣∣∣ 1√n
Φgλ
(1
nΦ>Φ
)(1nΦ>
0 Φ0 00 0
)gλ
(1
nΦ>Φ
)1√n
Φ>
∣∣∣∣∣∣∣∣ .
L.H. Dicker et al./Kernel methods for nonparametric regression 20
Next, we use the decomposition (20) and the regularization conditions (R1) and (R3),
IV0 ≤2Jσ2
n
∣∣∣∣∣∣∣∣∣∣gλ(
1
nΦ>Φ
)2(1
nΦ>Φ
)2∣∣∣∣∣∣∣∣∣∣
+2Jσ2
n
∣∣∣∣∣∣∣∣ 1√n
Φgλ
(1
nΦ>Φ
)(0 1
nΦ>
0 Φ11nΦ>
1 Φ01nΦ>
1 Φ1 − T1
)gλ
(1
nΦ>Φ
)1√n
Φ>
∣∣∣∣∣∣∣∣+
2Jσ2
n
∣∣∣∣∣∣∣∣ 1√n
Φgλ
(1
nΦ>Φ
)(0 00 T1
)gλ
(1
nΦ>Φ
)1√n
Φ>
∣∣∣∣∣∣∣∣≤{
2J +2J
λ(δ1 + t2J+1)
}σ2
n(27)
Finally, to bound IV1,
IV1 ≤σ2
ntr
{(0 00 T1
)gλ
(1
nΦ>Φ
)2(1
nΦ>Φ
)}
≤ σ2
ntr
{(0 00 T1
)} ∣∣∣∣∣∣∣∣∣∣gλ(
1
nΦ>Φ
)2(1
nΦ>Φ
)∣∣∣∣∣∣∣∣∣∣
≤ σ2
λntr(T1). (28)
The results follows by combining (25) and (27)–(28).
In order to prove Theorem 1 and other results in the main text we require large deviationand moment bounds on quantities related to n−1Φ>Φ − T . These bounds are collected inLemmas 3–5 and Corollary 5 below. Lemma 3 is a general large deviation bound for sumsof self-adjoint Hilbert-Schmidt operators; the other results are more tailored to the settingstudied in the main text.
Lemma 3. Let R > 0 be a positive real constant and consider a finite sequence of self-adjoint Hilbert-Schmidt operators {Xi}ni=1 satisfying E(Xi) = 0 and ||Xi|| ≤ R almost surely.Define Y =
∑ni=1 Xi and suppose there are constants C, d > 0 satisfying ||E(Y2)|| ≤ C and
tr{E(Y 2)} ≤ dC. For all t ≥ C1/2 +R/3,
P (||Y|| ≥ t) ≤ 4d exp
(−t2/2
C +Rt/3
).
Proof. This is a straightforward generalization of Theorem 7.7.1 of (Tropp, 2015), using thearguments from Section 4 from (Minsker, 2011) to extend from self-adjoint matrices to self-adjoint Hilbert-Schmidt operators.
L.H. Dicker et al./Kernel methods for nonparametric regression 21
Corollary 5. Suppose the conditions of Lemma 3 are satisfied and that q > 0 is a constant.Then
E(||Y||) ≤ (1 + 15d)C1/2 +
(1 + 16d
3
)R,
E(||Y||2) ≤ (2 + 32d)C +
(2 + 128d
9
)R2,
E(||Y||q) = O{
(1 + d)(Cq/2 +Rq)},
where the implicit constant in the big-O bound may depend on q, but not d, C, or R.
Proof. We bound E(||Y||) first. Indeed,
E(||Y||) =
∫ ∞0
P(||Y|| ≥ t) dt
≤ C1/2 +R
3+
∫ ∞C1/2+R/3
4d exp
(−t2/2
C +Rt/3
)dt
≤ C1/2 +R
3+
∫ 3C/R
0
4d exp
(− t2
4C
)dt+
∫ ∞3C/R
4d exp
(− 3t
4R
)dt
= C1/2 +
(1
3+
16
3de−
9C4R2
)R +
∫ 3C/R
0
4d exp
(− t2
4C
)dt
≤(1 + 8dπ1/2
)C1/2 +
(1 + 16de−
9C4R2
) R3
≤ (1 + 15d)C1/2 +
(1 + 16d
3
)R.
L.H. Dicker et al./Kernel methods for nonparametric regression 22
For the second moment bound,
E(||Y||2) =
∫ ∞0
P(||Y|| ≥ t1/2
)dt
≤(C1/2 +
R
3
)2
+
∫ ∞(C1/2+R
3 )2
4d exp
(−t/2
C +Rt1/2/3
)dt
≤(C1/2 +
R
3
)2
+
∫ 9C2/R2
0
4d exp
(− t
4C
)dt+
∫ ∞9C2/R2
4d exp
(−3t1/2
4R
)dt
=
(C1/2 +
R
3
)2
+ 16Cd
{1− exp
(− 9C
4R2
)}+
128R2d
9
{9C
4R2+ 1
}exp
(− 9C
4R2
)=
(C1/2 +
R
3
)2
+ 16Cd
{1 + exp
(− 9C
4R2
)}+
128R2d
9exp
(− 9C
4R2
)≤(C1/2 +
R
3
)2
+ 32Cd+128R2d
9
≤ (2 + 32d)C +
(2 + 128d
9
)R2.
Finally, to bound E(||Y||q),
E(||Y||q) =
∫ ∞0
P(||Y|| ≥ t1/q
)dt
≤(C1/2 +
R
3
)q+
∫ ∞(C1/2+R
3 )q
4d exp
(−t2/q/2
C +Rt1/q/3
)dt
≤(C1/2 +
R
3
)q+
∫ (3C/R)q
0
4d exp
(−t
2/q
4C
)dt+
∫ ∞(3C/R)q
4d exp
(−3t1/q
4R
)dt
= O{
(1 + d)(Cq/2 +Rq)},
as was to be shown.
Lemma 4. Let δ, δ1 be as defined in (18)–(19) and let q > 0 be a constant. Then
E(δ) ≤ 16κ2
n1/2+
6κ2
n, (29)
E(δ2) ≤ 34κ4
n+
15κ4
n2, (30)
E(δq) = O
{(κ4
n
)q/2}(31)
L.H. Dicker et al./Kernel methods for nonparametric regression 23
and
E(δ1) ≤ 16
{κ2
ntr(T1)
}1/2
+6κ2
n, (32)
E(δ21) ≤ 34κ2
ntr(T1) +
15κ4
n2, (33)
E(δq1) = O
[{κ2
ntr(T1)
}q/2+
(κ2
n
)q]. (34)
Proof. These bounds follow from Corollary 5. To prove the bounds on δ, we take Y =n−1Φ>Φ− T =
∑ni=1 Xi and Xi = n−1(φiφ
>i − T ), where φi = (φ1(xi), φ2(xi), ...)
>. ClearlyE(Xi) = 0 and ||Xi|| ≤ κ2/n. Additionally, since
E(Y2) =1
nE{
(φiφ>i − T )2
}=
1
nE(||φi||2φiφ>
i )− 1
nT 2,
it follows that ||E(Y2)|| ≤ κ4/n and tr{E(Y2)} ≤ κ4/n. The bounds (29)–(31) now followfrom a direct application of Corollary 5 with C = κ4/n, d = 1, and R = κ2/n.
To prove (32)–(34), take
Y =
(0 1
nΦ>
0 Φ11nΦ>
1 Φ01nΦ>
1 Φ1 − T1
)=
n∑i=1
Xi,
where
Xi =1
n
(0 φ0,iφ
>1,i
φ1,iφ>0,i φ1,iφ
>1,i − T1
)and
φ0,i = (φ1(xi), ..., φJ(xi))>, φ1,i = (φJ+1(xi), φJ+2(xi), ...)
>.
Observe that E(Xi) = 0 and ||Xi|| ≤ κ2/n. Additionally, since
E(Y2) = nE(X2i )
=1
nE{(
||φ1,i||2φ0,iφ>0,i φ0,iφ
>1,i(φ1,iφ
>1,i − T1)
(φ1,iφ>1,i − T1)φ1,iφ
>0,i (φ1,iφ
>1,i − T1)2
)}=
1
nE(||φ1,i||2φiφ>
i
)− 1
n
(0 00 T 2
1
).
it follows that ||E(Y2)|| ≤ κ2t2J+1/n ≤ κ2tr(T1)/n and tr{E(Y2)} ≤ κ2tr(T1)/n. Bounds(32)–(34) follow from Corollary 5, with C = κ2tr(T1)/n, d = 1, and R = κ2/n.
L.H. Dicker et al./Kernel methods for nonparametric regression 24
Lemma 5. Suppose that τ 2J ≥ 0 and
supx∈X
J∑j=1
ψj(x)2 ≤ 1
τ 2J.
If 0 ≤ δ < 1, then
Pr{A(δ)} ≥ 1− J{
e−δ
(1− δ)1−δ
}τ2Jn.
Proof. If τ 2J = 0, the result is trivial. For τ 2J > 0, this is a direct application of Theorem 5.1.1from (Tropp, 2015). Indeed, consider the decomposition
1
nΨ>
0 Ψ0 =n∑i=1
Xi,
where
Xi =1
nψ0,iψ
>0,i
and ψ0,i = (ψ1(xi), ..., ψJ(xi))>. The lemma follows by Theorem 5.1.1 of (Tropp, 2015), upon
noticing that
||Xi|| =∣∣∣∣∣∣∣∣ 1nψ0,iψ
>0,i
∣∣∣∣∣∣∣∣ =1
n
J∑j=1
ψj(xi)2 ≤ 1
τ 2Jn
and
E(
1
nΨ>
0 Ψ0
)= I.
Appendix B: Proofs of Results from the Main Text
Proof of Proposition 2 We decompose Bρ(gλ) according to whether we are on the eventA(1/2) or not. Let
Bρ(gλ) = Bρ{gλ;A(1/2)}+ Bρ{gλ;A(1/2)c}, (35)
where
Bρ{gλ;A(1/2)} = E
[∣∣∣∣∣∣∣∣T 1/2
{I − gλ
(1
nΦ>Φ
)1
nΦ>Φ
}β
∣∣∣∣∣∣∣∣2 ;A(1/2)
],
Bρ{gλ;A(1/2)} = E
[∣∣∣∣∣∣∣∣T 1/2
{I − gλ
(1
nΦ>Φ
)1
nΦ>Φ
}β
∣∣∣∣∣∣∣∣2 ;A(1/2)
].
L.H. Dicker et al./Kernel methods for nonparametric regression 25
By Lemmas 1 and 4,
Bρ{gλ;A(1/2)} ≤ E{(
12
t2Jλ2 +
12
t2Jδ21 + 13t2J+1
)||f †||2H;A(1/2)
}≤{
12
t2Jλ2 +
12
t2JE(δ21) + 13t2J+1
}||f †||2H
≤{
12
t2Jλ2 +
408κ4
nt2Jtr(T1) +
180κ4
n2t2J+ 13t2J+1
}||f †||2H. (36)
On the other hand, since∣∣∣∣∣∣∣∣T 1/2
{I − gλ
(1
nΦ>Φ
)1
nΦ>Φ
}β
∣∣∣∣∣∣∣∣2 ≤ κ2||f †||2H,
it follows from Lemma 5 that
Bρ{gλ;A(1/2)c} ≤ κ2||f †||2H Pr{A(1/2)c} ≤ κ2||f †||2HJ(
2
e
) τ2Jn
2
. (37)
The proposition follows by combining (35)–(37). �Proof of Proposition 3 The steps of the proof are similar to those in the proof of Proposition2. Let
Vρ(gλ) = Vρ{gλ;A(1/2)}+ Vρ{gλ;A(1/2)c}, (38)
where
Vρ{gλ;A(1/2)} =1
n2E
{∣∣∣∣∣∣∣∣T 1/2gλ
(1
nΦ>Φ
)Φ>ε
∣∣∣∣∣∣∣∣2 ;A(1/2)
},
Vρ{gλ;A(1/2)c} =1
n2E
{∣∣∣∣∣∣∣∣T 1/2gλ
(1
nΦ>Φ
)Φ>ε
∣∣∣∣∣∣∣∣2 ;A(1/2)c
}.
By Lemmas 1 and 4,
Vρ{gλ;A(1/2)} ≤ E[{
2J +2J
λ(δ1 + t2J+1) +
1
λtr(T1)
}σ2
n;A(1/2)
]≤[2J +
2J
λ{E(δ1) + t2J+1}+
1
λtr(T1)
]σ2
n
≤
[2J +
32J
λ
{κ2
ntr(T1)
}1/2
+12κ2J
λn+
2Jt2J+1
λ+
1
λtr(T1)
]σ2
n. (39)
L.H. Dicker et al./Kernel methods for nonparametric regression 26
On the other hand, since
1
n2E
{∣∣∣∣∣∣∣∣T 1/2gλ
(1
nΦ>Φ
)Φ>ε
∣∣∣∣∣∣∣∣2∣∣∣∣∣X}≤ κ2σ2
λn,
it follows from Lemma 5 that
Vρ{gλ;A(1/2)c} ≤ κ2σ2
λnPr{A(1/2)c} ≤ κ2σ2J
λn
(2
e
) τ2Jn
2
. (40)
The proposition follows by combining (38)–(40). �Proof of Corollary 1 Choose J in Theorem 1 so that t2J+1 < λ ≤ t2J . Then, J ≤ C
12ν λ−
12ν ,
n = λ−(1+ 12ν ) and, applying Theorem 1 with τ 2j = t2j/κ
2,
Rρ(fλ) ≤
{20C
12ν λ+ (1 + 16C
12ν κ2)λ
12ν
∞∑j=J+1
t2j
}σ2
+
{25λ+ 408κ2λ
12ν
∞∑j=J+1
t2j
}||f †||2H + o
{(σ2 + ||f †||2H
)λ}.
Since λ = n−2ν
2ν+1 , the corollary will follow if we can show that
∞∑j=J+1
t2j = O(λ1−12ν ). (41)
Let J0 =⌊C
12ν λ−
12ν
⌋+ 1. Then
∞∑j=J+1
t2j ≤J0∑
j=J+1
t2j +∞∑
j=J0+1
t2j
≤ J0λ+ C
∫ ∞J0
t−2ν dt
≤ J0λ+C
2ν − 1J1−2ν0
≤(
1 +2ν
2ν − 1C
12ν
)λ1−
12ν , (42)
L.H. Dicker et al./Kernel methods for nonparametric regression 27
which yields (41). This completes the proof of Corollary 1. �Proof of Corollary 2 Choose J in Theorem 1 so that
t2J+1 <2κ2
log(e/2)λ ≤ t2J .
Then J = O{log(n)} and it follows from Theorem 1 (with τ 2j = t2j/κ2) that
Rρ(fλ) ≤ O
[(σ2 + ||f †||2H
){λ+
1
log(n)
∞∑j=J+1
t2j
}].
The corollary will follow if we can show that
∞∑j=J+1
t2j = O
{log(n)2
n
}.
Let J0 = bα−1 log(n)c+ 1. Then
∞∑j=J+1
t2j ≤ J0t2J+1 +
∞∑j=J0+1
t2j ≤ O
{log(n)2
n
}+ C
∑j=J0+1
e−αj = O
{log(n)2
n
},
as was to be shown. �Proof of Corollary 3 Choose J in Theorem 1 so that t2J+1 < λ ≤ t2J . Then
J = O{
log(n)1/2}
and, by Theorem 1 with τ 2j = C−1e−αj,
Rρ(fλ) = O
[(σ2 + ||f †||2H
){λ+
1
log(n)1/2
∞∑j=J+1
t2j
}].
The corollary will follow if we can show that
∞∑j=J+1
t2j = O
{log(n)
n
}.
Let
J0 =
⌈{1
αlog
(C
2αλ
)}1/2⌉.
L.H. Dicker et al./Kernel methods for nonparametric regression 28
Then∞∑
j=J+1
t2j ≤ J0λ+ C
∫ ∞J0
e−αt2
dt ≤ J0λ+C
2αJ0e−αJ
20 = O
{log(n)
n
},
as was to be shown. �Proof of Corollary 4 To prove this corollary, the strategy is the same as in Corollaries 1–4.However, we must modify the bounds in Lemmas 1-2. First note that T1 = 0. Then followingthe notation in the proof of Lemmas 1-2, on the event A(1/2),
IB = IB0
≤
∣∣∣∣∣∣∣∣∣∣T 1/2
0
(1
nΦ>
0 Φ0
)−1/2∣∣∣∣∣∣∣∣∣∣2 ∣∣∣∣∣∣∣∣( ( 1nΦ>
0 Φ0
)1/20
0 0
){gλ
(1
nΦ>Φ
)1
nΦ>Φ− I
}β
∣∣∣∣∣∣∣∣2≤ tr
{T0(
1
nΦ>
0 Φ0
)−1}∣∣∣∣∣∣∣∣( ( 1nΦ>0 Φ0
)1/20
0 0
){gλ
(1
nΦ>Φ
)1
nΦ>Φ− I
}β
∣∣∣∣∣∣∣∣2≤ 2Jλ||f †||2H.
It follows that
Bρ(gλ) ≤
2Jλ+ κ2J
(2
e
) τ2Jn
2
||f †||2H. (43)
Additionally, on the event A(1/2),
IV = IV0 ≤σ2
ntr
{T0(
1
nΦ>
0 Φ0
)−1}∣∣∣∣∣∣∣∣gλ( 1
nΦ>Φ
)(1
nΦ>Φ
)∣∣∣∣∣∣∣∣2 ≤ 2Jσ2
n.
Thus,
Vρ(gλ) ≤
2J +κ2J
λ
(2
e
) τ2Jn
2
σ2
n. (44)
Combining Proposition 1 and (43)–(44), we obtain
Rρ(fλ) ≤
2Jλ+ κ2J
(2
e
) τ2Jn
2
||f †||2H +
2J +κ2J
λ
(2
e
) τ2Jn
2
σ2
n.
Corollary 4 follows immediately. �
L.H. Dicker et al./Kernel methods for nonparametric regression 29
Proof of Proposition 4 Pick J so that t2J+1 < λ ≤ t2J . For real numbers µ1, µ2 ≥ 0 andfunctions h : [0,∞)→ [0,∞), define
Kµ1,µ2(h) =
∣∣∣∣∣∣∣∣T µ1h( 1
nΦ>Φ
)T µ2
∣∣∣∣∣∣∣∣ .Also define the operator S by Sh(t) = th(t). Let hλ(t) = 1− gλ(t)t. Finally, suppose that
f † =∞∑j=1
βjφj =∞∑j=1
αjtµj φj ∈ Hµ.
Then ||f †||2Hµ = ||α||2, where α = (α1, α2, ...)>.
To prove the proposition, we follow the same strategy as in the proofs of Propositions 2–3;however, several modifications are required along the way. Following the notation of Lemmas1–2, our first task is to bound IB. We have
IB =
∣∣∣∣∣∣∣∣T 1/2
{I − gλ
(1
nΦ>Φ
)(1
nΦ>Φ
)}β
∣∣∣∣∣∣∣∣2=
∣∣∣∣∣∣∣∣T 1/2
{I − gλ
(1
nΦ>Φ
)(1
nΦ>Φ
)}T µ/2α
∣∣∣∣∣∣∣∣2≤ K1/2,µ/2(hλ)
2||f †||2Hµ . (45)
Furthermore, on A(1/2),
K1/2,µ/2(hλ) =
∣∣∣∣∣∣∣∣( T 1/20 00 0
)hλ
(1
nΦ>Φ
)T µ/2
∣∣∣∣∣∣∣∣+
∣∣∣∣∣∣∣∣( 0 0
0 T 1/21
)hλ
(1
nΦ>Φ
)T µ/2
∣∣∣∣∣∣∣∣≤
∣∣∣∣∣∣∣∣∣∣T 1/2
0
(1
nΦ>
0 Φ0
)−1∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣( 1
nΦ>
0 Φ0 00 0
)hλ
(1
nΦ>Φ
)T µ/2
∣∣∣∣∣∣∣∣+ tJ+1K0,µ/2(hλ)
≤ 2
tJ
{K0,µ/2(Shλ) + (δ1 + t2J+1)K0,µ/2(hλ)
}+ tJ+1K0,µ/2(hλ)
≤ 2
tJK0,µ/2(Shλ) +
(2δ1tJ
+ 3tJ+1
)K0,µ/2(hλ). (46)
Next we bound K0,µ/2(hλ) and K0,µ/2(Shλ). Suppose that µ/2 = m + γ, where m ≥ 0 isan integer and 0 ≤ g < 1. Recall the definition of δ from (18). By repeated application of the
L.H. Dicker et al./Kernel methods for nonparametric regression 30
triangle inequality,
K0,µ/2(h) = Km+γ,0(h)
=
∣∣∣∣∣∣∣∣T m+γh
(1
nΦ>Φ
)∣∣∣∣∣∣∣∣≤∣∣∣∣∣∣∣∣T m−1+γSh( 1
nΦ>Φ
)∣∣∣∣∣∣∣∣+ δκ2(m−1+γ)∣∣∣∣∣∣∣∣h( 1
nΦ>Φ
)∣∣∣∣∣∣∣∣...
≤∣∣∣∣∣∣∣∣T γSmh( 1
nΦ>Φ
)∣∣∣∣∣∣∣∣+ δ
m∑j=1
κ2(m−j+γ)∣∣∣∣∣∣∣∣Sj−1h( 1
nΦ>Φ
)∣∣∣∣∣∣∣∣= K0,γ(S
mh) + δm∑j=1
κ2(m−j+γ)∣∣∣∣∣∣∣∣Sj−1h( 1
nΦ>Φ
)∣∣∣∣∣∣∣∣ . (47)
Bounding K0,γ(Smh) will require some additional manipulations. However, before proceeding
in this direction, we gather our bounds so far. By (46)–(47) and because {gλ} has qualificationat least dµ/2e,
K1/2,µ/2(hλ) = O
{1
λ1/2K0,γ(S
m+1hλ) +
(δ1λ1/2
+ λ1/2)K0,γ(S
mhλ) + δλ1/2 +δ1δ
λ1/2
}on A. Combining this with (45) yields
Bρ(gλ) = E{IB;A(1/2)}+ E{IB;A(1/2)c}
≤ O
[E{
1
λK0,γ(S
m+1hλ)2 +
(δ21λ
+ λ
)K0,γ(S
mhλ)2 + δ2λ+
δ21δ2
λ
}||f †||2Hµ
]+ κ2(µ+1)||f †||2Hµ Pr {A(1/2)c} .
By Lemma 4,
E(δ2) = O
(1
n
),
E(δ21δ2) = O
{tr(T1)n2
+1
n3
}.
(Note that in the current proof, we allow the big-O constants to depend on the kernel K,while this is not allowed in Lemma 4; see the comment following the statement of Proposition
L.H. Dicker et al./Kernel methods for nonparametric regression 31
4 and the statement of Corollary 5.) Thus,
Bρ(gλ) ≤ O
[E{
1
λK0,γ(S
m+1hλ)2 +
(δ21λ
+ λ
)K0,γ(S
mhλ)2
}||f †||2Hµ
]+O
[λ
n||f †||2Hµ +
1
λ
{tr(T1)n2
+1
n3
}||f †||2Hµ
]+ κ2(µ+1)||f †||2Hµ Pr {A(1/2)c}
= O
[E{
1
λK0,γ(S
m+1hλ)2 +
(δ21λ
+ λ
)K0,γ(S
mhλ)2
}||f †||2Hµ
]+ o
(||f †||2Hλµ+1
), (48)
where we have also made use of Lemma 5 and (42) to bound Pr{A(1/2)c} and tr(T1), respec-tively.
To complete our analysis of the bias term Bρ(gλ), it remains to bound the expectationsinvolving K0,γ above. We consider separately the cases where 0 ≤ γ ≤ 1/2 and 1/2 < γ < 1.If 1/2 < γ < 1 and h : [0,∞) → [0,∞) is a nonnegative function, then, since z 7→ zγ isoperator monotone (Mathe and Pereverzev, 2002),
K0,γ(h) =
∣∣∣∣∣∣∣∣T γh( 1
nΦ>Φ
)∣∣∣∣∣∣∣∣≤∣∣∣∣∣∣∣∣T γ − ( 1
nΦ>Φ
)γ∣∣∣∣∣∣∣∣K0,0(h) +
∣∣∣∣∣∣∣∣( 1
nΦ>Φ
)γh
(1
nΦ>Φ
)∣∣∣∣∣∣∣∣≤ δγK0,0(h) +
∣∣∣∣∣∣∣∣( 1
nΦ>Φ
)γh
(1
nΦ>Φ
)∣∣∣∣∣∣∣∣ .Hence, taking h = Smhλ and h = Sm+1hλ, we conclude that if 1/2 < γ < 1, then
K0,γ(Smhλ) ≤ δγλm + λµ/2,
K0,γ(Sm+1hλ) ≤ δγλm+1 + λµ/2+1.
Thus, by Lemmas 4 and , if q > 0 is constant and 1/2 < γ < 1, then
E {K0,γ(Smhλ)
q} = O{n−qγ/2λqm + λqµ/2
}= O(λqµ/2), (49)
E{K0,γ(S
m+1hλ)q}
= O{n−qγ/2λq(m+1) + λq(µ/2+1)
}= O{λq(µ/2+1)}. (50)
L.H. Dicker et al./Kernel methods for nonparametric regression 32
On the other hand, if 0 ≤ γ ≤ 1/2, then, on A(1/2),
K0,γ(h) =
∣∣∣∣∣∣∣∣T γh( 1
nΦ>Φ
)∣∣∣∣∣∣∣∣≤∣∣∣∣∣∣∣∣( T γ0 0
0 0
)h
(1
nΦ>Φ
)∣∣∣∣∣∣∣∣+ t2γJ+1K0,0(h)
≤
∣∣∣∣∣∣∣∣∣∣T γ0
(1
nΦ>
0 Φ0
)−1∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣( 1
nΦ>
0 Φ0 00 0
)h
(1
nΦ>Φ
)∣∣∣∣∣∣∣∣+ t2γJ+1K0,0(h)
≤ 2t2(γ−1)J
{K0,0(Sh) + (δ1 + t2J+1)K0,0(h)
}+ t2γJ+1K0,0(h)
≤ 2λγ−1K0,0(Sh) + 2λγ−1δ1K0,0(h) + 3λγK0,0(h).
Again taking h = Smhλ and h = Sm+1hλ, if 0 ≤ γ ≤ 1/2 and we are on A(1/2), then
K0,γ(Smhλ) ≤ 2δ1λ
µ/2−1 + 5λµ/2,
K0,γ(Sm+1hλ) ≤ 2δ1λ
µ/2 + 5λµ/2+1.
Taking expectations, Lemma 4–5 imply that if 0 ≤ γ ≤ 1, then
E {K0,γ(Smhλ)
q} = O
[{tr(T1)n
}q/2λq(µ/2−1) +
1
nqλq(µ/2−1) + λqµ/2
]= O(λqµ/2), (51)
E{K0,γ(S
m+1hλ)q}
= O
[{tr(T1)n
}q/2λqµ/2 +
1
nrλqµ/2 + λq(µ/2+1)
]= O{λq(µ/2+1)}. (52)
Thus, combining (49)–(52), we see that
E {K0,γ(Smhλ)
q} = O(λqµ/2), (53)
E{K0,γ(S
m+1hλ)q}
= O{λq(µ/2+1)} (54)
for all 0 ≤ γ < 1. Finally, using (53)–(54) in (48), along with Lemma 4 yields
Bρ(gλ) = O
[E{
1
λK0,γ(S
m+1hλ)2 +
(δ21λ
+ λ
)K0,γ(S
mhλ)2
}||f †||2Hµ
]+ o
(||f †||2Hλµ+1
)= O
(λµ+1||f †||2H
). (55)
To complete the proof, we must bound Vρ(gλ), but this is a direct application of Proposition3: One easily checks that
Vρ(gλ) = O(λµ+1σ2). (56)
L.H. Dicker et al./Kernel methods for nonparametric regression 33
The proposition follows by combining (55)–(56) with Proposition 1.�
Proof of Proposition 5 Let t21 ≥ t22 ≥ · · · ≥ 0 denote the eigenvalues of n−1ΦTΦ and defineT = diag(t21, t
22, ...). Let U be an orthogonal transformation satisfying n−1Φ>Φ = U T U>.
Additionally, let hλ(t) = I{t ≤ λ}, define J = Jλ = inf{j; t2j > λ}, and write U = (UJ UJc),
where UJ is the ∞× J matrix comprised of the first J columns of U and UJc consists of the
remaining columns of U . Finally, define T0 = diag(t2J+1
, t2J+2
, ...) and define the∞× J matrix
UJ = (IJ 0)>.Following the notation from the proof of Lemma 1, consider the following bound on IB,
IB =
∣∣∣∣∣∣∣∣T 1/2hλ
(1
nΦ>Φ
)β
∣∣∣∣∣∣∣∣2 =∣∣∣∣∣∣T 1/2UJcU
>JcUJU
>J β∣∣∣∣∣∣2 ≤ κ2
∣∣∣∣∣∣U |topJc
UJ
∣∣∣∣∣∣2 ||f †||2H.It follows that
Bρ(gλ) = E(IB) ≤ κ2||f †||2HE(∣∣∣∣∣∣U>
JcUJ
∣∣∣∣∣∣2 ; J ≥ J
)+ κ2||f †||2H Pr(J < J). (57)
Now we bound ||U>JcUJ || on the event {J ≥ J}. We derive this bound from basic principles, but
it is essentially the Davis-Kahan inequality (Davis and Kahan, 1970). Let D = n−1Φ>Φ−T .Then
DUJ =1
nΦ>ΦUJ − T UJ =
1
nΦ>ΦUJ − UJT0
and
U>JDUJc = U>
J
(1
nΦ>Φ
)UJc − T0U
>J UJc = U>
J UJc T1 − T0U>J UJc .
Now observe that ∣∣∣∣∣∣∣∣ 1nΦ>Φ− T∣∣∣∣∣∣∣∣ ≥ ||U>
J ∆UJc ||
≥ ||T0U>J UJc|| − ||U
>J UJc T1||
≥ t2J ||U>J UJc|| − (1− r)t2J ||U>
J UJc ||= rt2J ||U>
J UJc ||.
Thus,
||U>J UJc|| ≤
1
rt2J
∣∣∣∣∣∣∣∣ 1nΦ>Φ− T∣∣∣∣∣∣∣∣ (58)
on {J > J}.
L.H. Dicker et al./Kernel methods for nonparametric regression 34
Next we boundPr(J < J) = Pr(t2J ≤ λ) = Pr
{t2J ≤ (1− r)t2J
}.
By Weyl’s inequality,
|t2J − t2J | ≤ δ =
∣∣∣∣∣∣∣∣ 1nΦ>Φ− T∣∣∣∣∣∣∣∣ .
Additionally, by Lemma 3,
Pr(δ ≥ rt2J) ≤ 4 exp
(− nr2t4J
2κ4 + 2κ2rt2J/3
)Thus,
Pr(J < J) = Pr{t2J ≤ (1− r)t2J
}≤ Pr(δ ≥ rt2J) ≤ 4 exp
(− nr2t4J
2κ4 + 2κ2rt2J/3
). (59)
Combining (57)–(59) and using Lemma 4 gives
Bρ(gλ) ≤κ2
r2t4J||f †||2HE(δ2) + 4κ2||f †||2H exp
(− nr2t4J
2κ4 + 2κ2rt2J/3
)≤ κ2
r2t4J||f †||2H
(34κ4
n+
15κ4
n2
)+ 4κ2||f †||2H exp
(− nr2t4J
2κ4 + 2κ2rt2J/3
).
Next, we combine this bound on Bρ(gλ) with Proposition 3 to obtain
Rρ(fλ) = Bρ(gλ) + Vρ(gλ)
≤ κ2
r2t4J||f †||2H
(34κ4
n+
15κ4
n2
)+ 4κ2||f †||2H exp
(− nr2t4J
2κ4 + 2κ2rt2J/3
)
+
2J +2Jt2J+1
λ+
1
λtr(T1) +
32J
λ
{κ2
ntr(T1)
}1/2
+12κ2J
λn+κ2J
λ
(2
e
) t2Jn
2κ2
σ2
n
= O
{(||f †||2H + σ2
) Jn
},
as was to be shown. �
References
Aronszajn, N. (1950). Theory of reproducing kernels. T. A. Math. Soc. 68 337–404.
L.H. Dicker et al./Kernel methods for nonparametric regression 35
Bach, F. (2013). Sharp analysis of low-rank kernel matrix approximations. In Conferenceon Learning Theory.
Bauer, F., Pereverzev, S. and Rosasco, L. (2007). On regularization algorithms inlearning theory. J. Complexity 23 52–72.
Belitser, E. and Levit, B. (1995). On minimax filtering over ellipsoids. Math. Meth.Statist 4 259–273.
Caponnetto, A. and De Vito, E. (2007). Optimal rates for the regularized least-squaresalgorithm. Found. Comput. Math. 7 331–368.
Caponnetto, A. and Yao, Y. (2010). Cross-validation based adaptation for regularizationoperators in learning theory. Anal. Appl. 8 161–183.
Carmeli, C., De Vito, E. and Toigo, A. (2006). Vector valued reproducing kernel Hilbertspaces of integrable functions and Mercer theorem. Anal. Appl. 4 377–408.
Conway, J. (1999). A Course in Operator Theory. American Mathematical Society.Davis, C. and Kahan, W. (1970). The rotation of eigenvectors by a perturbation. III. SIAM
J. Numer. Anal. 7 1–46.Dhillon, P., Foster, D., Kakade, S. and Ungar, L. (2013). A risk comparison of
ordinary least squares vs ridge regression. J. Mach. Learn. Res. 14 1505–1511.Dicker, L. (2015). Ridge regression and asymptotic minimax estimation over spheres of
growing dimension. Benoulli To appear.Drineas, P. and Mahoney, M. (2005). On the Nystrom method for approximating a Gram
matrix for improving kernel-based learning. J. Mach. Learn. Res. 6 2153–2175.Engl, H., Hanke, M. and Neubauer, A. (1996). Regularization of Inverse Problems.
Mathematics and Its Applications, Vol. 375, Springer.Fasshauer, G. and McCourt, M. (2012). Stable evaluation of Gaussian radial basis
function interpolants. SIAM J. Sci. Comput. 34 A737–A762.Gittens, A. and Mahoney, M. (2013). Revisiting the Nystrom method for improved
large-scale machine learning. In International Conference on Machine Learning.Golubev, Y., Levit, B. and Tsybakov, A. (1996). Asymptotically efficient estimation of
analytic functions in Gaussian noise. Bernoulli 2 167–181.Guo, Y., Bartlett, P., Shawe-Taylor, J. and Williamson, R. (2002). Covering
numbers for support vector machines. IEEE T. Inform. Theory 48 239–250.Gyorfi, L., Kohler, M., Krzyzak, A. and Walk, H. (2002). A Distribution Free Theory
of Nonparametric Regression. Springer.Hsu, D., Kakade, S. and Zhang, T. (2014). Random design analysis of ridge regression.
Found. Comput. Math. 14 569–600.Ibragimov, I. and Khas’minskii, R. (1983). Estimation of distribution density belonging
to a class of entire functions. Theor. Probab. Appl+ 27 551–562.Koltchinskii, V. (2006). Local Rademacher complexities and oracle inequalities in risk
L.H. Dicker et al./Kernel methods for nonparametric regression 36
minimization. Ann. Stat. 34 2593–2656.Lo Gerfo, L., Rosasco, L., Odone, F., De Vito, E. and Verri, A. (2008). Spectral
algorithms for supervised learning. Neural Comput. 7 1873–1897.Lopez-Paz, D., Sra, S., Smola, A., Ghahramani, Z. and Scholkopf, B. (2014). Ran-
domized nonlinear component analysis. In International Conference on Machine Learning.Mathe, P. (2005). Saturation of regularization methods for linear ill-posed problems in
Hilbert spaces. SIAM J. Numer. Anal. 42 968–973.Mathe, P. and Pereverzev, S. (2002). Moduli of continuity for operator valued functions.
Numer. Func. Anal. Opt. 23 623–631.Minsker, S. (2011). On Some Extensions of Bernstein’s Inequality for Self-adjoint Operators.
ArXiv e-prints .Nemirovskii, A. (1985). Nonparametric estimation of smooth regression functions. Sov. J.
Comput. Syst. S+ 23 1–11.Nemirovskii, A., Polyak, B. and Tsybakov, A. (1983). Estimators of maximum likeli-
hood type for nonparametric regression. Dokl. Math. 28 788–792.Neubauer, A. (1997). On converse and saturation results for Tikhonov regularization of
linear ill-posed problems. SIAM J. Numer. Anal. 34 517–527.Rahimi, A. and Recht, B. (2008). Random features for large-scale kernel machines. In
Advances in Neural Information Processing Systems.Rosasco, L., De Vito, E. and Verri, A. (2005). Spectral methods for regularization in
learning theory. Tech. Rep. DISI-TR-05-18, Universita degli Studi di Genova, Italy.Steinwart, I., Hush, D. and Scovel, C. (2009). Optimal rates for regularized least
squares regression. In Conference on Learning Theory.Tropp, J. (2015). An Introduction to Matrix Concentration Inequalities. ArXiv e-prints .Tsybakov, A. (2004). Introduction a l’estimation non-parametrique. Springer.Wasserman, L. (2006). All of Nonparametric Statistics. Springer.Williams, C. and Seeger, M. (2001). Using the Nystrom method to speed up kernel
machines. In Advances in Neural Information Processing Systems.Yang, Y., Pilanci, M. and Wainwright, M. (2015). Randomized sketches for kernels:
Fast and optimal non-parametric regression. ArXiv e-prints .Zhang, T. (2005). Learning bounds for kernel regression using effective data dimensionality.
Neural Comput. 17 2077–2098.Zhang, Y., Duchi, J. and Wainwright, M. (2013). Divide and conquer kernel ridge
regression. In Conference on Learning Theory.