Regularization Techniques and Suboptimal Solutions to

37
LETTER Communicated by Alessandro Verri Regularization Techniques and Suboptimal Solutions to Optimization Problems in Learning from Data Giorgio Gnecco [email protected] Departments of Communications, Computer, and System Sciences and of Computer and Information Science, University of Genova, Genova 16146, Italy Marcello Sanguineti [email protected] Department of Communications, Computer, and System Sciences, University of Genova, Genova 16146, Italy Various regularization techniques are investigated in supervised learn- ing from data. Theoretical features of the associated optimization prob- lems are studied, and sparse suboptimal solutions are searched for. Rates of approximate optimization are estimated for sequences of suboptimal solutions formed by linear combinations of n-tuples of computational units, and statistical learning bounds are derived. As hypothesis sets, reproducing kernel Hilbert spaces and their subsets are considered. 1 Introduction For nonempty sets X and Y , Y R, a joint probability measure ρ on X × Y, and a loss function V : R 2 [0, +∞) such that V(u, u) = 0 for all u R, statistical learning theory (SLT) (Cucker & Smale, 2001; Vapnik, 1998) models the learning problem as the minimization of the expected error functional E V ( f ) = X×Y V( f (x), y) dρ over a suitable linear space H of functions, called hypothesis space. The quantity V( f (x), y) measures how much is lost when f (x) is computed instead of y; a typical choice is the square loss V( f (x), y) = ( f (x) y) 2 . The mapping f ρ : X R, defined as f ρ (x) = R ydρ ( y | x), is called regression function; if f ρ H and the square loss is used, then f ρ is the unique minimizer over H of the expected error functional (Cucker & Smale, 2001). In practice, since ρ usually is unknown or difficult to model, the mini- mization of the expected error functional is replaced by the minimization of a suitable empirical approximation, which depends on a data sample z ={(x i , y i ) X × Y , i = 1,..., m}, also called the training set. Typically the data sample is made up of m independent and identically distributed (i.i.d.) pairs (x i , y i ), generated according to the probability measure ρ . The empiri- cal error functional is defined as E z,V ( f ) = 1 m m i =1 V( f (x i ), y i ) . To simplify Neural Computation 22, 793–829 (2010) C 2009 Massachusetts Institute of Technology

Transcript of Regularization Techniques and Suboptimal Solutions to

LETTER Communicated by Alessandro Verri

Regularization Techniques and Suboptimal Solutionsto Optimization Problems in Learning from Data

Giorgio [email protected] of Communications, Computer, and System Sciences and of Computerand Information Science, University of Genova, Genova 16146, Italy

Marcello [email protected] of Communications, Computer, and System Sciences,University of Genova, Genova 16146, Italy

Various regularization techniques are investigated in supervised learn-ing from data. Theoretical features of the associated optimization prob-lems are studied, and sparse suboptimal solutions are searched for. Ratesof approximate optimization are estimated for sequences of suboptimalsolutions formed by linear combinations of n-tuples of computationalunits, and statistical learning bounds are derived. As hypothesis sets,reproducing kernel Hilbert spaces and their subsets are considered.

1 Introduction

For nonempty sets X and Y, Y ⊂ R, a joint probability measure ρ on X × Y,and a loss function V : R

2 → [0,+∞) such that V(u, u) = 0 for all u ∈ R,statistical learning theory (SLT) (Cucker & Smale, 2001; Vapnik, 1998)models the learning problem as the minimization of the expected errorfunctional EV( f ) = ∫

X×Y V( f (x), y) dρ over a suitable linear space H offunctions, called hypothesis space. The quantity V( f (x), y) measures howmuch is lost when f (x) is computed instead of y; a typical choice is thesquare loss V( f (x), y) = ( f (x) − y)2. The mapping fρ : X → R, defined asfρ(x) = ∫

Ry dρ(y | x), is called regression function; if fρ ∈ H and the square

loss is used, then fρ is the unique minimizer over H of the expected errorfunctional (Cucker & Smale, 2001).

In practice, since ρ usually is unknown or difficult to model, the mini-mization of the expected error functional is replaced by the minimizationof a suitable empirical approximation, which depends on a data samplez = {(xi , yi ) ∈ X × Y, i = 1, . . . , m}, also called the training set. Typically thedata sample is made up of m independent and identically distributed (i.i.d.)pairs (xi , yi ), generated according to the probability measure ρ. The empiri-cal error functional is defined as Ez,V( f ) = 1

m

∑mi=1 V( f (xi ), yi ) . To simplify

Neural Computation 22, 793–829 (2010) C© 2009 Massachusetts Institute of Technology

794 G. Gnecco and M. Sanguineti

the notation, we denote by E and Ez the expected and empirical error func-tionals, respectively, with the square loss function, that is, we let

E( f ) =∫

X×Y( f (x) − y)2 dρ and Ez( f ) = 1

m

m∑

i=1

( f (xi ) − yi )2.

For suitable choices of subsets of the hypothesis space H, various regu-larization techniques (Tikhonov & Arsenin, 1977), interpreted according toSLT (Evgeniou, Pontil, & Poggio, 2000), allow one to derive upper boundson the difference between the expected and empirical errors. The boundshold uniformly over such subsets of the hypothesis space, with high a prioriprobability with respect to the random draw of the data sample z; in thefollowing, we shall refer to them as SLT bounds.

As an example of a regularization technique, one can add to the empiricalerror functional one term, called a stabilizer, which penalizes undesiredproperties of a candidate solution. For example, when H is a linear spaceendowed with the norm ‖ · ‖H, one may search for a minimizer of theregularized empirical error functional Ez( f ) + γ ‖ f ‖2

H, where γ > 0 is calledthe regularization parameter, thus considering the problem1

min f ∈H(Ez( f ) + γ ‖ f ‖2

H). (1.1)

The parameter γ controls the trade-off between the two requirements of(1) fitting to the data sample (via the value Ez( f ) of the empirical error incorrespondence of f ) and (2) penalizing every candidate solution f with alarge value of the squared norm ‖ f ‖2

H. For suitable choices of the hypothesisspace H, the norm ‖ f ‖H is a measure of the smoothness of the function (see,e.g., Girosi, 1998). Alternatively, one can restrict the minimization of the em-pirical error to a subset M ofH, which is called the hypothesis set and containsonly functions with a desired behavior, so one considers the problem

min f ∈MEz( f ). (1.2)

A third possibility is to combine the two methods (use of a stabilizer andrestriction to a hypothesis set). These three approaches are applicationsto learning from data of regularization techniques for ill-posed inverseproblems developed during the 1960s. The first is an application ofTikhonov regularization, the second and the third of Ivanov and Millerregularizations, respectively. A related form of regularization is Phillipsregularization (Bertero, 1989).

1Whenever we write “min,” we implicitly suppose that such a minimum exists. Ifit does not, we mean that we are interested, for ε > 0, in an ε-near minimum point.For example, when the minimum in equation 1.1 is not achieved, we are interested infε ∈ H : Ez( fε) + γ ‖ fε‖2

H ≤ inf f ∈H(Ez( f ) + γ ‖ f ‖2H) + ε.

Regularization and Suboptimal Solutions in Learning from Data 795

When the hypothesis space is a Hilbert space of a special type, that is, areproducing kernel Hilbert space (Aronszajn, 1950), the optimal solutionsto the learning problems associated with Tikhonov, Ivanov, Phillips, andMiller regularizations are expressed in terms of linear combinations of thecolumns of an m × m matrix obtained on the basis of the data sample (seetheorem 1). Unfortunately, in general, the number of nonzero coefficientsin such linear combinations is not much smaller than the size m of the datasample; in other words, such optimal solutions are not sparse. A sparsesuboptimal solution f has several advantages over an optimal nonsparseone f o ; for example, storing its coefficients requires less memory and com-puting f (x) for x not belonging to the data sample requires less time thancomputing f o(x). Motivated by these advantages, we investigate rates ofapproximate optimization achievable by sparse suboptimal solutions forthe regularization techniques already identified.

The letter is both a collection of original contributions and a survey ofrelated results. Propositions 1(iv), 3, 4, and 6 and the results in sections 4, 5,and 6 are original contributions. In proposition 1(iv), we prove the existenceof a solution to Miller regularization in a slightly more general setting thanthe one considered in Miller (1970). Propositions 3 and 4 provide lower andupper bounds on the parameters in the Ivanov-regularized learning prob-lem. Proposition 6 states a simple but useful property of sparse suboptimalsolutions to the four regularized learning problems. Section 4 improvesrates of approximate optimization by sparse suboptimal solutions derivedin Kurkova and Sanguineti (2005b) and Zhang (2002a) for Tikhonov reg-ularization. Sections 5 and 6 give new estimates for Ivanov, Miller, andPhillips regularization in learning from data. For all these techniques, wealso investigate SLT bounds for sparse suboptimal solutions: we provideconditions under which one can control with high probability (uniformlyover the family of functions in which one searches for suboptimal solutions)the difference between the expected and the empirical errors.

As to the survey contribution of the letter, part of section 2 summa-rizes known relationships among Tikhonov, Ivanov, Phillips, and Millerregularizations. The first part of section 3 discusses some drawbacks totheir respective optimal solutions that motivate the search for sparse sub-optimal solutions. Section 7 provides a short discussion on some availablealgorithms, which can be used to find sparse suboptimal solutions to theregularized learning problems.

To make the letter self-contained, in the appendix, we summarize sometools from SLT, exploited in sections 4 to 6.

2 Comparison of Regularization Techniques

Reproducing kernel Hilbert spaces (RKHSs), defined almost sixty yearsago (Aronszajn, 1950), are now commonly used as hypothesis spaces inlearning theory. Combined with regularization techniques, often they allow

796 G. Gnecco and M. Sanguineti

good generalization capabilities of the learned models and enforce desiredsmoothness properties of the solutions to the learning problems.

Given a nonempty set X, as RKHS is a Hilbert space HK (X) made upof functions on X, such that for every u ∈ X, the evaluation functional Fu,defined for every f ∈ HK (X) as Fu( f ) = f (u), is bounded (Aronszajn, 1950;Berg, Christensen, & Ressel, 1984; Cucker & Smale, 2001). We consider realRKHSs. By the Riesz representation theorem (Friedman, 1982), for everyu ∈ X, there exists a unique element Ku ∈ HK (X) such that for every f ∈HK (X), one has

Fu( f ) = 〈 f, Ku〉K , (2.1)

where 〈·, ·〉K denotes the inner product in HK (X); this is called the repro-ducing property. The symmetric function K : X × X → R defined for ev-ery u, v ∈ X as K (u, v) = 〈Ku, Kv〉K , is called the kernel associated withHK (X). The kernel K is a positive-semidefinite (psd) function; for all pos-itive integers m, all (w1, . . . , wm) ∈ R

m, and all (u1, . . . , um) ∈ Xm, one has∑mi, j=1 wi w j K (ui , u j ) ≥ 0. The kernel is called positive definite (pd) if the

equality in the previous definition holds only for (w1, . . . , wm) = (0, . . . , 0),when all the ui are different. By equation 2.1, K (u, ·) ∈ HK (X), and it coin-cides with Ku.

Typical regularization techniques perturb the functional (Tikhonov reg-ularization); restrict the minimization to a subset of the hypothesis space,called the hypothesis set (Ivanov regularization); replace the empirical er-ror with another functional and use the former to define the hypothesisset (Phillips regularization); or combine the last two possibilities (Millerregularization). (See Table 1.) As Tikhonov (Tikhonov & Arsenin, 1977) de-veloped a unifying formulation of regularization theory, sometimes all suchtechniques are improperly gathered as “Tikhonov regularization.”

We use the following standard notations from optimization theory (see,e.g., Dontchev & Zolezzi, 1993). Given a function space X , a set M ⊆ X ,and a functional � : M → R, we denote by (M,�) the problem inf f ∈M �( f ).Every f ◦ ∈ M such that �( f ◦) = min f ∈M�( f ) is called a solution or a min-imum point of the problem (M,�). We denote by

argmin(M,�) ={

f ∈ M : �( f ) = minf ∈M

�( f )}

the set of solutions of (M,�). A sequence { fn} of elements of M is called�-minimizing over M if limn→+∞ �( fn) = inf f ∈M �( f ).

Tikhonov regularization in a strict sense (see row 1 of Table 1) involvesthe minimization of the functional Ez(·) + γ ‖ · ‖2

K over the whole RKHSHK (X). Besides motivations based on SLT (Evgeniou et al., 2000), which weshall discuss later, this kind of regularization is motivated by the fact thatthe norms ‖ · ‖K on suitable RKHSs play the role of measures of various

Regularization and Suboptimal Solutions in Learning from Data 797

Tabl

e1:

Reg

ular

izat

ion

Tech

niqu

esfo

rL

earn

ing

from

Dat

ain

RK

HSs

.

Reg

ular

izat

ion

Reg

ular

izat

ion

Hyp

othe

sis

Min

imiz

atio

nSo

luti

onTe

chni

que

Para

met

erFu

ncti

onal

Set

Prob

lem

Tikh

onov

γ>

0�

T,γ

(·)H

K(X

)Pr

oble

mT

γ:

fo T,γ

=E z

(·)+

γ‖·

‖2 K(H

K(X

),�

T,γ

)

Ivan

ovr

>0

�I,

r(·)

=E z

(·)B r

(‖·‖

K)

Prob

lem

I r:

fo I,r

(Br(

‖·‖ K

),�

I,r)

Phill

ips

η≥

0�

P,η

(·)=

‖·‖2 K

F z,η

Prob

lem

P η:

fo P,η

={f

∈H

K(X

):E z

(f)≤

η2 }

(Fz,

η,

�P

,η)

Mill

err

>0,

η≥

0�

M,r

,η(·)

Gz,

r,η

Prob

lem

Mr,

η:

fo M,r

=E z

(·)+

( ηr

) 2‖·

‖2 K=

B r(‖

·‖K

)∩F z

,η(G

z,r,

η,

�M

,r,η

)

798 G. Gnecco and M. Sanguineti

types of oscillations of input-output mappings (Girosi, 1998). Ivanov reg-ularization (row 2 of Table 1) considers the minimization of the empiricalerror functional Ez on a closed ball Br (‖ · ‖K ) of radius r in the RKHS HK (X).In Phillips regularization (row 3), the functionals Ez and ‖ · ‖2

K exchangetheir roles: the minimization of Ez is replaced by the minimization of ‖ · ‖2

K ,and, in turn, an upper bound on the values of Ez defines the hypothesis set.Finally, Miller regularization (row 4) combines the approaches of Ivanovand Phillips regularizations: both the empirical error functional and thesquared-norm functional contribute to defining the regularized functionaland the hypothesis set.

The four techniques consider regularization of the learning problemfrom different points of view; each has pros and cons from an opti-mization perspective. One advantage of Tikhonov regularization is that itmerely requires unconstrained optimization algorithms, whereas the otherthree methods need algorithms for constrained optimization. However, thechoice of the regularization parameter γ is not straightforward, even thoughvarious criteria have been proposed for such a choice (see, e.g., Bertero,1989; Wahba, 1990; and Engl, Hanke, & Neubauer, 2000, for general inverseproblems and Cucker & Smale, 2002, and De Vito, Rosasco, Caponnetto,De Giovannini, & Odone, 2005, more specifically for learning problems).From a practical point of view, the regularization parameters r > 0 and η ≥ 0used in Ivanov, Phillips, and Miller regularization methods have a more di-rect interpretation: r and η play the role of upper bounds on the smoothnessof the solution and the mean square error on the data, respectively. Theirchoices may also be guided by the equivalence of the four regularizationtechniques, stated in proposition 2. (We do not consider the case r = 0, as thecorresponding hypothesis set consists of only the null function f = 0. Thecase η = 0 corresponds to a hypothesis set that interpolates the data exactly.)

In the rest of this section, we summarize some results on existence,uniqueness, and properties of the solutions to the regularized problems.Then we shall give some original contributions (see propositions 1(iv), 3,and 4).

Proposition 1 investigates the existence and uniqueness of the solutionsto problems Tγ , Ir , Pη, and Mr,η. Using standard notations (Engl et al.,2000; Kurkova, 2005), we denote by f †z the minimum-norm solution to theproblem (HK (X), Ez) (also called normal pseudosolution) and by Lx theoperator Lx : HK (X) → R

m defined for every f ∈ HK (X) as

Lx( f ) =(

f (x1)√m

, . . . ,f (xm)√

m

)T

, (2.2)

where the superscript T denotes transposition. For y = (y1, . . . , ym)T ∈ Rm,

Py is the component of y in the range of the operator Lx and Qy, its

Regularization and Suboptimal Solutions in Learning from Data 799

orthogonal component. As in Table 1, we denote by Gz,r,η = Br (‖ · ‖K ) ∩{ f ∈ HK (X) : Ez( f ) ≤ η2} the hypothesis set of Miller regularization.

Proposition 1 (existence and uniqueness of the solutions). Let X be anonempty set, K : X × X → R a psd kernel, m a positive integer, and z a datasample of size m. For every γ, r > 0 and η ≥ 0, the following hold:

i. Problem Tγ has a unique solution f oT,γ .

ii. Problem Ir has a solution f oI,r , which is unique if r ≤ ‖ f †z ‖K .

iii. Problem Pη has a solution f oP,η, which is unique if ‖Qy‖2

2m ≤ η2. For ‖y‖2

2m ≤ η2

one has f oP,η = 0.

iv. Problem Mr,η has a solution f oM,r,η if the set G z,r,η is nonempty; the solution

is unique if the set G z, r√2,

η√2

is nonempty too.

Proof. A proof of proposition 1(i) together with an explicit expression forf oT,γ , is given in Engl et al. (2000).

ii, iii. These follow by (Bertero, 1989).iv. As our formulation of Miller regularization slightly differs from

the original one in Miller (1970), we provide the proof of theexistence of f o

M,r,η when the set Gz,r,η is nonempty. For the proofof existence and uniqueness in the case Gz, r√

2,

η√2

⊂ Gz,r,η, one canapply the arguments of Miller (1970).

Let { fn} be a minimizing sequence for the problem (Gz,r,η, �M,r,η). By def-inition, for every positive integer n, one has Ez( fn) ≤ η2 and fn ∈ Br (‖ · ‖K ),so { fn} is a bounded sequence. Thus, there exist a subsequence { fnh } and anelement f ∗ ∈ HK (X) such that { fnh } ⇀ f ∗ as h → +∞, where “⇀” denotesweak convergence. As by definition in an RKHS, all evaluation functionalsare bounded; this implies Lx( fnh ) → Lx( f ∗) as h → +∞, so Ez( f ) ≤ η2. Bythe weak lower semicontinuity of the norm in a Hilbert space (Rauch, 1991),it follows that ‖ f ∗‖K ≤ infh∈N ‖ fnh ‖K , so f ∗ ∈ Br (‖ · ‖K ). Then f ∗ ∈ Gz,r,η,and it is a solution to problem Mr,η.

For ease of reference, we collect in the following assumption the condi-tions that guarantee the existence and uniqueness of the solutions to thefour regularized problems. We further assume ‖ f †z ‖K > 0, as it is usuallythe case (otherwise, yi = 0 for i = 1, . . . , m):

Assumption 1. Let the following hold:

Problem Tγ : γ > 0.Problem Ir : 0 < r ≤ ‖ f †z ‖K .Problem Pη: 0 ≤ ‖Qy‖2

2m ≤ η2.

Problem Mr,η: r > 0 and η ≥ 0 such that Gz, r√2,

η√2

is nonempty.

800 G. Gnecco and M. Sanguineti

Under assumption 1, Tikhonov, Ivanov, Phillips, and Miller regulariza-tions are equivalent in the sense specified by proposition 2:

Proposition 2 (equivalent regularizations). Let X be a nonempty set, K :X × X → R a psd kernel, m a positive integer, and z a data sample of size m. Thenunder assumption 1, the solutions to problems Tγ , Ir , Pη, and Mr,η are equivalentin the following sense:

i. f oT,γ ∈ argmin(HK (X),�T,γ ) =⇒ f o

T,γ ∈ argmin(Br (γ )(‖ · ‖K ),�I,r (γ )),where r (γ ) = ‖ f o

T,γ ‖K .ii. f o

T,γ ∈ argmin(HK (X), �T,γ ) =⇒ f oT,γ ∈ argmin(Fz,η(γ ), �P,η(γ )) ,

where η(γ ) =√Ez( f o

T,γ ).iii. f o

T,γ ∈ argmin(HK (X), �T,γ ) =⇒ f oT,γ ∈ argmin(G z,η,r , �M,η,r ),

where γ = (η

r

)2, η ≥√

2Ez( f oT,γ ), and r ≥ √

2‖ f oT,γ ‖K .

For γ ∈ (0,+∞), the functions r (γ ) and η(γ ) in items i and ii are one-to-one andmonotonic, with ranges (0, ‖ f †z ‖K ) and ( ‖Qy‖2√

m ,‖y‖2√

m ), respectively.

Proof. See Bertero (1989), Vasin (1970), or Mukherjee, Rifkin, & Poggio(2002) (the last two references consider only cases i and ii). The last state-ment, about the properties of r (γ ) and η(γ ), follows the expressions of‖ f o

T,γ ‖K and Ez( f oT,γ ) given in Bertero (1989).

For the case of Tikhonov regularization, the next theorem is known asthe representer theorem. It was originally proven in Wahba (1990), later inCucker and Smale (2002) using Frechet derivatives, and then in Kurkova(2004, 2005) via properties of inverse problems and the operator Lx (seeequation 2.2).

We denote by K[x] the Gram matrix of the data with respect to x—them × m matrix whose elements are K[x]i, j = K (xi , xj ).

Theorem 1 (representer theorem). Let X be a nonempty set, K : X × X → R

a psd kernel, m a positive integer, and z a data sample of size m. Then underassumption 1, the optimal solutions to problems Tγ , Ir , Pη, and Mr,η are of the form

f αz (·) =

m∑i=1

cα,i Kxi (·), (2.3)

where, for every m, every K[x], and every y, the values of α > 0 corresponding tof oT,γ , f o

I,r , f oP,η, and f o

M,r,η are functions of the respective parameters γ , r , η, and(r, η). The vector cα = (cα,1, . . . , cα,m)T is the unique solution to the well-posedlinear system

(α m I + K[x])cα = y. (2.4)

Regularization and Suboptimal Solutions in Learning from Data 801

For problem Tγ one has α(γ ) = γ and for problem Pη, the case ‖y‖22

m ≤ η2 (seeproposition 1(iii) is covered by letting α → +∞.

Proof. By the properties of continuous linear operators (Bertero, 1989;Groetch, 1977), for every u ∈ R

m, the adjoint operator L∗x : R

m → HK (X)is given by L∗

x(u) = 1√m

∑mi=1 ui Kxi (see also Kurkova, 2005). So for the case

of Tikhonov regularization, equation 2.5 follows Bertero (1989), adapting toL∗

x the expression for f αz given in equation 226 in Bertero. The other cases

follow by the equivalence of the four regularizations stated in proposition 2.

Despite the equivalence, in the sense stated by proposition 2, amongTikhonov, Ivanov, Phillips, and Miller regularizations, the values of theparameters guaranteeing such equivalence are not known a priori. Forexample, the function r (γ ) in proposition 2 depends on the solution f o

T,γ .In the remainder of the letter, in general we shall not make the assumptionthat r (γ ), η(γ ) and their inverses, γ (r ) and γ (η), are known, and we shallexploit lower and upper bounds on their values. The following propositionprovides such bounds on γ (r ):

Proposition 3. Let X be a nonempty set, K : X × X → R a psd kernel, m apositive integer, z a sample of data of size m, 0 < r ≤ ‖ f †z ‖K , and λmin, λmax theminimum and maximum eigenvalues of K[x], respectively. Then the followinghold:

i.√

λmin‖y‖2−λmaxrmr ≤ γ (r ) ≤

√λmax‖y‖2−λminr

mr .

ii. minλ∈[λmin,λmax]

√λ‖y‖2−λr

mr ≤ γ (r ) ≤ maxλ∈[λmin,λmax]

√λ‖y‖2−λr

mr .

iii. Let λ∗ = ‖y‖2

4r2 . If λ∗ ∈ [λmin, λmax], then minλ∈{λmin,λmax}√

λ‖y‖2−λrmr ≤ γ (r ) ≤

‖y‖2

4mr2 , else minλ∈{λmin,λmax}√

λ‖y‖2−λrmr ≤ γ (r ) ≤ maxλ∈{λmin,λmax}

√λ‖y‖2−λr

mr .

Proof. (i) For 0 < r ≤ ‖ f †z ‖K , by proposition 2(i), one has f oT,γ (r ) = f o

I,r and‖ f o

T,γ (r )‖K = r . By theorem 1, we get f oT,γ (r ) = ∑m

i=1 ci Kxi , where

c = (c1, . . . , cm)T = (γ (r ) m I + K[x])−1 y,

and by the reproducing property 2.1, ‖ f oT,γ ‖K =

√cTK[x]c. Then the state-

ment follows by

r = ‖ f oT,γ (r )‖K =

√cTK[x]c ≤

√λmax‖c‖2 ≤

√λmax

γ (r )m + λmin‖y‖2, (2.5)

802 G. Gnecco and M. Sanguineti

and, similarly,

r ≥√

λmin‖c‖2 ≥√

λmin

γ (r )m + λmax‖y‖2. (2.6)

(ii) Proceeding as in the proof of case i, since the matrices K[x] and(γ (r ) m I + K[x])−1 have the same eigenvectors, a slightly refined analysisgives

r =√

cTK[x]c =√

yT (γ (r ) m I + K[x])−T K[x] (γ (r ) m I + K[x])−1 y

≤ maxλ∈[λmin,λmax]

√λ

γ (r )m + λ‖y‖2 (2.7)

and

r ≥ minλ∈[λmin,λmax]

√λ

γ (r )m + λ‖y‖2. (2.8)

We conclude by rewriting the last two inequalities in terms of γ (r ).(iii) The estimates follow from case ii by showing that the function f :

[0,+∞) → R defined as f (λ) = √λ‖y‖2 − λr is strictly increasing in [0, λ∗],

strictly decreasing in [λ∗,+∞), and such that f (0) = 0.

Note that the first bound in proposition 3(i) is meaningful only when0 < r <

√λmin‖y‖2λmax

. The first bound in the proposition is slightly better, butstill not meaningful if λmin = 0. However, there exist cases in which λmin canbe bounded from below away from 0. For example, for the gaussian kernel(defined for every β > 0 and (u, v) ∈ R

d × Rd as K (u, v) = e−β‖u−v‖2

2 ), if thedata samples have sufficiently separated input values x1, . . . , xm, then by astraightforward application of the Gershgorin circle theorem (Serre, 2002),all the eigenvalues of the Gram matrix K[x] are concentrated around 1. Formore general kernels, other lower bounds on λmin (whose proofs are basedon the Gershgorin circle theorem too) are given in Honeine, Richard, andBermudez (2007).

When λmin = 0, the following proposition improves proposition 3. Wedenote by λ+

min the minimum positive eigenvalue of the Gram matrix K[x]:

Proposition 4. Under the same assumptions of proposition 3, the same estimatesgiven there hold if every instance of λmin and y is replaced by λ+

min and P y,respectively.

Proof. When λmin > 0, proposition 3 applies with no changes, as in thiscase λ+

min = λmin and Py = y. When λmin = 0, let us decompose c and y as

Regularization and Suboptimal Solutions in Learning from Data 803

c = cN + cR and y = yN + yR, respectively, where cN and yN are the com-ponents of c and y in the null space of K[x] and cR and yR are the respectiveorthogonal components in the range ofK[x]. AsK[x] and (γ (r ) m I + K[x])−1

have the same eigenvectors, we get cN = (γ (r ) m I + K[x])−1 yN and cR =(γ (r ) m I + K[x])−1 yR. By basic results of spectral theory, equation 2.5 be-comes

r = ∥∥ f o

T,γ (r )

∥∥

K =√

cTK[x]c =√

(cN )TK[x]cN + (cR)TK[x]cR

=√

(cR)TK[x]cR =√

λmax‖cR‖2 ≤√

λmax

γ (r )m + λ+min

‖y‖2. (2.9)

One can proceed similarly for the other bounds. We conclude the proofnoting that by proposition 3.1(iv) in Kurkova (2004), the operator Lx L∗

x :R

m → Rm is represented in the canonical basis of R

m by the matrix 1mK[x],

from which it follows that Py = yR.

The next proposition illustrates the relationships among the ‖ · ‖K -normsof the solutions to problems Ir , Pη, and Mr,η, the corresponding minimumvalues of the empirical error functional, and the values of the parametersγ, r, and η.

Proposition 5. Let X be a nonempty set, K : X × X → R a psd kernel, m apositive integer, and z a data sample of size m. Then under assumption 1, thefollowing hold:

i. ‖ f oP,η‖K ≤ ‖ f o

M,η,r‖K ≤ ‖ f oI,r‖K

ii. Ez( f oI,r ) ≤ Ez( f o

M,r,η) ≤ Ez( f oP,η)

iii. γ (r ) ≤ (η

r

)2 ≤ γ (η) ,

where γ (r ) and γ (η) are the inverse of the functions r(γ ) and η(γ ), respectively,given in proposition 2.

Proof. Items i to iii, which were proven in Bertero (1989), follow by propo-sition 2.

Typically the norm ‖ · ‖K in an RKHS is a measure of the smoothnessof its functions (Girosi, 1998). So proposition 5 has the following interpre-tation. According to item i, for fixed values of the parameters γ , r , and η,the solution to the learning problem obtained by Phillips regularization issmoother than the solution obtained by Miller regularization, which in turnis smoother than the solution to the learning problem derived by Ivanov reg-ularization. However, items i and ii imply a trade-off between smoothnessand value of the empirical error functional (i.e., fitting to empirical data);roughly speaking, the smoother the solution, the larger the empirical error.

804 G. Gnecco and M. Sanguineti

So the solution to problem Mr,η has a degree of smoothness and a value ofthe empirical error that are intermediate between the corresponding valuesfor the solutions to problems Pη and Ir .

3 Optimal Versus Suboptimal Solutions

For a linear space X and G ⊂ X , we denote by spann G the set of linearcombinations of all n-tuples of elements of G:

spannG ={

n∑

i=1

wi gi : wi ∈ R, gi ∈ G

}

.

By theorem 1, the solution f oT,γ to problem Tγ over the whole RKHS

HK (X) is a function in span G Kx , where

G Kx = {K (xi , ·) : i = 1, . . . , m}.

As the cardinality of G Kx is equal to m, we have span G Kx = spanm G Kx ,which can be interpreted as the set of all input-output functions of a compu-tational model with one hidden layer of m computational units computingfunctions in G Kx . By proposition 2, under assumption 1, the solutions to theother three regularized methods can be obtained from instances of problemTγ with suitable choices of the regularization parameter γ . So the solutionsto problems Ir , Pη, and Mr,η are functions in spanm G Kx too. In particular, forthe gaussian kernel, the solution given by theorem 1 can be computed by agaussian radial basis function network (Girosi, 1994) with m hidden units.However, for large values of m, such a network might not be practicallyimplementable, and even when it is, one may wish to use a suboptimal butsimpler network structure.

Although finding the optimal linear combination in spanmG Kx merelyrequires solving the linear systems of equations 2.4, practical applicationsare limited by the rates of convergence of iterative methods solving suchlinear systems. These rates depend on the size of the condition number ofthe corresponding matrices (see the discussion in Kurkova & Sanguineti,2005b). Recall that the condition number of a nonsingular m × m matrix Awith respect to a norm ‖ · ‖ on R

m is defined as

cond(A) = ‖A‖ ‖A−1‖,

where ‖A‖ denotes the norm of A as a linear operator on (Rm, ‖ · ‖). Forevery symmetric nonsingular m × m matrix A, cond2(A) = |λmax(A)|

|λmin(A)| , wherecond2(A) denotes the condition number of A with respect to the ‖ · ‖2-normon R

m (Ortega, 1990).

Regularization and Suboptimal Solutions in Learning from Data 805

In the discussion of this paragraph, to fix ideas, we consider problemTγ and so in equation 2.4, we let α = γ . However, by proposition 2, sim-ilar remarks hold for the other three regularization techniques. By simplealgebraic manipulations and spectral theory arguments, for the conditionnumbers of the matrices involved in solving the linear system 2.4, we get thefollowing estimates (we assume for simplicity that K[x] is positive definite,so that λmin �= 0):

cond2(γ mI + K[x]) = γ m + λmax

γ m + λmin≤ λmax

λmin= cond2(K[x]), (3.1)

cond2(γ mI + K[x]) = γ mγ m + λmin

+ λmax

γ m + λmin≤ 1 + λmax

γ m. (3.2)

When cond2(K[x]) is sufficiently small, for every γ > 0, inequality 3.1 guar-antees good conditioning of the matrix. However, for large values of the sizem of the data sample, the matrix K[x] may be ill conditioned. On the otherhand, since limγ→+∞ (1 + λmax/γ m) = 1, inequality 3.2 guarantees that theregularization parameter γ can be chosen such that cond2(γ mI + K[x]) isarbitrarily close to 1. Unfortunately, good conditioning of the matrices isnot the only requirement for γ , as its value must also allow a good fit tothe empirical data and thus cannot be too large (an optimal choice of theregularization parameter γ is a compromise between these two opposingrequirements).

In the following, we consider suboptimal solutions to problems Tγ , Ir , Pη,and Mr,η, formed by linear combinations of n < m functions Kx1 , . . . , Kxm

centered at the data points x1, . . . , xm, that is, by elements of spannG Kx . Sothese suboptimal solutions have the structure of widespread connectionisticmodels, such as one-hidden-layer feedforward neural networks, and aresparse when n � m. More generally, one can consider suboptimal solutionsin spannG K , where G K = {Kx : x ∈ X} (in this case, the kernel functionsKx are not necessarily centered at the input data points). The following aresome motivations to search for a sparse suboptimal solution f s instead ofan optimal (usually nonsparse) one f o :

� Storing its coefficients requires less memory.� The computational effort required to find f s may be much smaller

than for f o (this happens, e.g., for greedy algorithms; see the dis-cussion in section 7 and Zhang, 2002a, 2002b, 2003, for Tikhonovregularization).

� Once f s has been found, computing f s(x) for x not belonging to thedata sample is easier than computing f o(x).

� A sparse model has a better interpretability than a nonsparse one (it isa simpler model, which depends on fewer parameters; Zou & Hastie,2005).

806 G. Gnecco and M. Sanguineti

The next proposition states a simple but interesting property of approx-imate solutions from span G Kx .

Proposition 6. Let K be a psd kernel and f ∈ span G Kx . Then Lx( f ) = 0 if andonly if f = 0.

Proof. As f ∈ span G Kx , there exist c1, . . . , cm ∈ R such that f = ∑mi=1 ci Kxi .

By the reproducing property 2.1, f (xi ) = (K[x]c)i and ‖ f ‖2K = cTK[x]c. So

for f ∈ span G Kx , f (xi ) = 0 for all i ∈ {1, . . . , m} is equivalent to ‖ f ‖2K = 0.

Proposition 6 implies that suboptimal solutions in spann G Kx have nocomponents in the null space of the operator Lx; the same occurs (underthe conditions of existence and uniqueness given by assumption 1) for thecorresponding optimal solutions and the normal pseudosolution f †z . Thenull space of Lx represents a component of the data generation model thatcannot be observed directly from the data (this motivates the terminologyinvisible component; (Bertero, 1989), so it cannot be recovered from the datawithout additional a priori information. Note also that by the reproducingproperty 2.1, the equality (cN )TK[x]cN = 0 in the proof of proposition 4implies

∑mi=1 cNi Kxi = 0. This is in accordance with proposition 6.

In sections 4 and 5, we estimate for Tikhonov and Ivanov regularizationsthe rates of approximate optimization achievable by sparse suboptimalsolutions in spann G Kx with n < m (without loss of generality, we let n < m,as for n ≥ m the optimal solutions belong to spannG Kx ). Phillips and Millerregularizations will be considered in section 6, where we also briefly addresspossible improvements allowed by the use of free internal parameters,that is, when suboptimal solutions are from spann G K , n < m. For gaussiankernel units, the first situation corresponds to gaussians centered at n < mdata points, chosen among the sample input data {x1, . . . , xm}. The secondcase corresponds to gaussians centered at n < m points, to be chosen in theentire input set X.

The estimates will be expressed in terms of a norm called G-variation,denoted by ‖ · ‖G and defined for a subset G of a normed linear space (X , ‖ ·‖X ) as the Minkowski functional of the set cl conv (G ∪ −G) (Kurkova, 1997)(the closure is with respect to the norm ‖ · ‖X of X ), that is, ‖ f ‖G = inf{c >

0 : f/c ∈ cl conv (G ∪ −G)}. G-variation is a generalization of the conceptof total variation of integration theory (Kolmogorov & Fomin, 1975). (Forproperties of G-variation, see Kurkova, 1997; Kurkova & Sanguineti, 2001,2002.)

We also use the concept of modulus of continuity. Let S ⊆ X and � : S →R a continuous functional. For f ∈ S, the function α f : [0,+∞) → [0,+∞),defined as

α f (t) = sup{|�( f ) − �(g)| : g ∈ S, ‖ f − g‖X ≤ t},

Regularization and Suboptimal Solutions in Learning from Data 807

is called the modulus of continuity of � at f . By definition, α f is a nonde-creasing function. By αT,γ and αI,r we denote the moduli of continuity of�T,γ and �I,r , respectively, computed at the corresponding minimizers.

For ε > 0, fε ∈ M is an ε-near minimum point of the problem (M,�) if�( f ) ≤ inf f ∈M �( f ) + ε.

4 Estimates for Tikhonov-Regularized Learning

For Tikhonov regularization, the following theorem improves the estimatesof the accuracy of suboptimal solutions derived in Kurkova and Sanguineti(2005b). Using the same notation as in Makovoz (1996), we let

ξn(G Kx )

= inf{ξ > 0 : G Kx can be covered by at most n sets of diameter ≤ ξ}.

Theorem 2. Let X be a nonempty set, K : X × X → R a psd kernel with sK =supx∈X

√K (x, x) < +∞, m a positive integer, z a data sample of size m, and

�T,γ,n,z = 2(ξ�n/2�(G Kx )‖ f oT,γ ‖G Kx

)2. Then for every positive integer n < m, thefollowing hold.

i. Let αT,γ (t) be the modulus of continuity of �T,γ at f oT,γ and

dT,γ,n,z = infg∈spannG Kx

∥∥∥ f oT,γ − g

∥∥∥K.

Then

inff ∈spann G Kx

�T,γ ( f ) − �T,γ ( f oT,γ ) ≤ αT,γ

(dT,γ,n,z

), (4.1)

where αT,γ (t) ≤ (s2K + γ )t2 and dT,γ,n,z ≤

√�T,γ,n,z

n .ii. Let εn > 0 and fn an εn-near minimum point of (spann G Kx ,�T,γ ). Then

∥∥∥ fn − f oT,γ

∥∥∥2

K≤ (s2

K + γ )�T,γ,n,z + εn

γ n. (4.2)

Proof. (i) The inequality 4.1 follows by the definition of modulus of continu-ity. To prove the upper bound on αT,γ , we proceed as follows. The first-orderterm in the Taylor’s expansion, centered at f o

T,γ , of the quadratic functional�T,γ vanishes by the first-order optimality condition for unconstrainedoptimization. So �T,γ ( f ) = �T,γ ( f o

T,γ ) + 1m

∑mi=1〈Kxi , ( f − f o

T,γ )〉2K + γ ‖ f −

f oT,γ ‖2

K . Then for ‖ f − f oT,γ ‖K ≤ t, we get

∣∣ �T,γ ( f ) − �T,γ ( f o

T,γ )∣∣ ≤ 1

m

m∑

i=1

∥∥Kxi

∥∥2

K

∥∥ f − f o

T,γ

∥∥2

K + γ∥∥ f − f o

T,γ

∥∥2

K

≤ (s2

K + γ)‖ f − f o

T,γ ‖2K ≤ (

s2K + γ

)t2.

808 G. Gnecco and M. Sanguineti

The upper bound on dT,γ,n,z and the expression of �T,γ,n,z follow by inspec-tion of the proof of Makovoz’s improvement over Maurey-Jones-Barron’slemma, given in theorem 1 in Makovoz (1996) restated in terms of G-variation.

(ii) By the definition of fn and Kurkova and Sanguineti (2005b, propo.4.1), we have

∥∥ fn − f o

T,γ

∥∥2

K ≤ αT,γ

(dT,γ,n,z

) + εn

γ,

which, together with case i, implies the estimate, equation 4.2.

Theorem 2 gives the upper bound,

inff ∈spann G Kx

�T,γ ( f ) − �T,γ ( f oT,γ ) ≤

2(s2K + γ )

(ξ�n/2�(G Kx )

∥∥ f o

T,γ

∥∥

G Kx

)2

n,

(4.3)

which improves the estimate

inff ∈spann G Kx

�T,γ ( f ) − �T,γ ( f oT,γ ) ≤ A/

√n + B/n, (4.4)

from Kurkova and Sanguineti (2005b), where Aand B are positive constantsdefined there. The improvement is twofold. The first-order optimality con-dition exploited in the proof of theorem 5(i) allows improving the estimateof the modulus of continuity of �T,γ at f o

T,γ (the linear term in t, whichdetermines the term A/

√n in equation 4.4, is removed). Moreover, the term

ξ�n/2�(G Kx ), which by definition is a nonincreasing function of n, may pro-vide an overall rate better than 1/n in equation 4.3. For instance, if forsome l < m/2 the elements of G Kx are well approximated by the centroidsof l clusters, then the term ξl (G Kx ) is small, and so the upper bound oninf f ∈span2l G Kx

�T,γ ( f ) − �T,γ ( f oT,γ ) provided by theorem 5(i) may be much

smaller than (1/n)(2(s2K + γ )‖ f o

T,γ ‖2G Kx

). This case is of particular interestwhen there are many similar samples and a small number of significantclusters; of course, this depends on the properties of the probability mea-sure ρ.

Remark 1. Let �T,γ = s2K ‖ f o

T,γ ‖2G Kx

− ‖ f oT,γ ‖2

K . By proposition 5.3(iii) of

Kurkova and Sanguineti (2005b), �T,γ ≤ (s2K m−λmin)‖y‖2

2(γ m+λmin)2 , where λmin is the

minimum eigenvalue of K[x]. Exploiting Maurey-Jones-Barron’s lemma(see Kurkova & Sanguineti, 2005a) instead of its improvement by theorem 1

Regularization and Suboptimal Solutions in Learning from Data 809

of Makovoz (1996), used in the proof of theorem 5 (i), we get the worse but

simpler upper bound dT,γ,n,z ≤√

�T,γ

n .

Remark 2. Bounds with a similar structure as equation 4.3 were derived inZhang (2002a, 2002b, 2003) for Tikhonov regularization and, more generally,convex optimization problems whose domains are convex hulls of certainsets of functions. The following are some differences between the estimate,equation 4.4, and such results.

� The estimates derived in Zhang (2002a, 2002b, 2003) are of order1/n and have a constructive nature, in the sense that algorithms aredescribed to achieve them (for some particular cases; in Zhang, 2002a,geometric upper bounds are also derived; however, as remarked inZhang, 2002a, they depend on quantities that may be ill behaved).

� Our bounds are nonconstructive, but in various cases, they are betterthan those given in Zhang (2002a, 2002b, 2003). This happens thanksto the term ξ�n/2�(G Kx ) in equation 4.3, which takes into account prop-erties of the set G Kx . To fix ideas, let us consider the particular case ofa Gram matrix K[x] with only l < m/2 distinct columns (i.e., there arem − l repeated samples). Then equation 4.3 with n = 2l gives,2

inff ∈span2l G Kx

�T,γ ( f ) − �T,γ

(f oT,γ

) ≤ 0. (4.5)

So, inf f ∈span2l G Kx�T,γ ( f ) = �T,γ ( f o

T,γ ), whereas corollary 1 in Zhang(2002a) provides

inff ∈span2l G Kx

�T,γ ( f ) − �T,γ

(f oT,γ

) ≤ �T,γ (0) (M2 + 2γ )2

(2γ )2 · 2l + (M2 + 2γ )2 , (4.6)

where M = maxi‖Kxi ‖K .� Differently from Zhang (2002a, 2002b, 2003), theorem 2(ii) also gives

an upper bound on the quantity ‖ fn − f oT,γ ‖2

K , where fn is any εn-nearminimum point of (spann G Kx ,�T,γ ).

� The upper bounds of order 1/n provided in Zhang (2002a, 2002b,2003) can be applied to the case of Tikhonov regularization butnot to the cases of Ivanov, Miller, and Phillips regularizations. Forinstance, lemma 3.1 in Zhang, 2002b requires that the domain ofthe functional � to be minimized is the convex hull conv(S) of aset S of functions. In order to apply this lemma to Tikhonov reg-ularization, one considers the functional �T,γ on conv(S), whereS = {‖ f o

T,γ ‖G KxG Kx ∪ −‖ f o

T,γ ‖G KxG Kx}. For Ivanov regularization, one

should consider the functional �I,r on conv(S) ∩ Br (‖ · ‖K ), whereS = {‖ f o

I,r‖G KxG Kx ∪ −‖ f o

I,r‖G KxG Kx}. However, in general conv(S) �⊆

2A refinement in the proof technique of Makovoz (1996, theorem 1) is required in orderto replace 2l by l in equation 4.5.

810 G. Gnecco and M. Sanguineti

Br (‖ · ‖K ), whereas the proof of Zhang’s lemma 3.1 works when thewhole conv(S) is contained in the domain of the functional to beminimized.

We conclude this section by showing that under mild conditions, ourapproach provides SLT bounds that control, with high probability and uni-formly over the family of functions within which we search for suboptimalsolutions, the difference between the expected and the empirical errors (noestimates of this kind were given in Kurkova and Sanguineti (2005b), Zhang(2002a, 2002b, 2003).

Suppose that an a priori upper bound θ > 0 is available on the absolutevalue of the output y. Then �T,γ (0) = Ez(0) = 1

m

∑mi=1 y2

i ≤ θ2. So as �T,γ isthe sum of two nonnegative functionals, both f o

T,γ and any reasonable sub-optimal solution fn ∈ spannG Kx are such that ‖ f o

T,γ ‖K ≤ θ√γ

and ‖ fn‖K ≤ θ√γ

(it is not reasonable to consider a suboptimal solution fn ∈ spannG Kx with‖ fn‖K > θ√

γ, because in such a case, one would have �T,γ ( fn) ≥ �T,γ (0), so

0 ∈ spannG Kx would be a better suboptimal solution). Thus, if X is compactand the kernel K is continuous, theorem 7 in the appendix (with R1 = θ/

√γ

and R2 = θ max{1/sK , 1/√

γ } = νT (γ )) gives for every γ > 0, every m ∈ N,and every τ ∈ (0, 1),

P{

supf ∈Bθ/

√γ (‖·‖K )

(E( f ) − Ez( f )) ≤ 16s2K νT (γ )√

m(θ/

√γ + νT (γ ))

+ 4(sK νT (γ ))2

√ln(2/τ )

2m

}

≥ 1 − τ, (4.7)

where P denotes the a priori probability with respect to the random drawof the i.i.d. data sample z. Let Pz,n f o

T,γ be any best approximation of f oT,γ

on spann G Kx (such a best approximation exists, since spannG Kx is a fi-nite union of convex and closed nonempty sets in a Hilbert space, and foreach of them, there is a best approximation of f o

T,γ (Limaye, 1996, theorem3.5)). By definition, ‖ f o

T,γ − Pz,n f oT,γ ‖K = dT,γ,n,z, and by geometric prop-

erties of the set spannG Kx , one has ‖Pz,n f oT,γ ‖K ≤ ‖ f o

T,γ ‖K ≤ θ/√

γ . Thisshows that the bound 4.3 and the estimates from theorem 2 still hold, to-gether with equation 4.7, if one replaces the constraint f ∈ spann G Kx withf ∈ spann G Kx ∩ Bθ/

√γ (‖ · ‖K ) (the only change required in the proof of the-

orem 2(i) consists in exploiting the equality ‖ f oT,γ − Pz,n f o

T,γ ‖K = dT,γ,n,z).

Remark 3. The discussion above (which, for simplicity, has been limitedto a fixed value of γ ) can be improved in two ways. First, one can applythe structural risk minimization principle (Vapnik, 1998) in a form similarto the one in (Shawe-Taylor, Bartlett, Williamson, & Anthony, 1998, theo-rem 2.1) restated for regression problems and in terms of Rademacher’s

Regularization and Suboptimal Solutions in Learning from Data 811

complexity instead of the VC dimension. In such a way, by fixing twosequences {γ j > 0} and {a j > 0} such that

∑+∞j=1 a j = 1 and applying the

union-bound technique, instead of equation 4.7, one gets.3

P{

∀ j ∈ N ∀ f ∈ Bθ/√

γ j (‖ · ‖K ) : (E( f ) − Ez( f ))

≤ 16s2K νT (γ j )√

m(θ/

√γ j + νT (γ j )) + 4(sK νT (γ j ))2

√ln(2/(a jτ ))

2m

}

≥ 1 − τ. (4.8)

This suggests choosing (a posteriori) a value of γ j in the sequence, whichminimizes an upper bound on

inff ∈spann G Kx ∩Bθ/

√γ j (‖·‖K )

Ez( f ) + 16s2K νT (γ j )√

m(θ/

√γ j + νT (γ j ))

+ 4(sK νT (γ j ))2

√ln(2/(a jτ ))

2m,

or (taking into account that Ez( f ) ≤ �T,γ j ( f )) on

inff ∈spann G Kx ∩Bθ/

√γ j (‖·‖K )

�T,γ j ( f ) + 16s2K νT (γ j )√

m(θ/

√γ j + νT (γ j ))

+ 4(sK νT (γ j ))2

√ln(2/(a jτ ))

2m.

Second, one can improve the analysis starting from a data-dependent SLTbound like theorem 6 instead of the data-independent one provided bytheorem 7, which does not take into account any information available fromthe data (e.g., in the proof of theorem 7, we have used the data-independentbound

∫X K (t, t)dρX(t) ≤ s2

K , but a better estimate may be available from thedata).

3Suppose that the sequence {γ j } is decreasing and lim j→+∞ γ j = 0. Then for everyf ∈ HK (X), one can apply the bound inside the braces of equation 4.8, taking γ equalto the largest γ j in the sequence such that f ∈ Bθ/

√γ j (‖ · ‖K ). Extensions of equation 4.8,

holding uniformly for all values of γ in a given interval may be obtained by followingthe technique used in Bousquet and Elisseeff (2002, proof of theorem 18).

812 G. Gnecco and M. Sanguineti

5 Estimates for Ivanov-Regularized Learning

For Ivanov regularization, estimates of the accuracy of suboptimal solutionsare given in the following theorem (recall that Br (‖ · ‖K ) is the hypothesisset in Ivanov regularization).

Theorem 3. Let X be a nonempty set, K : X × X → R a psd kernel with sK =supx∈X

√K (x, x) < +∞, m a positive integer, z a data sample of size

m, |y|max = max{|yi | : i = 1, . . . , m}, 0 < r ≤ ‖ f †z ‖K , and �I,r,n,z = 2(ξ�n/2�(G Kx )‖ f o

I,r‖G Kx)2. Then for every positive integer n < m the following hold:

i. Let αI,r (t) be the modulus of continuity of �I,r at f oI,r and

dI,r,n,z = infg∈spannG Kx

‖ f oI,r − g‖K .

Then

inff ∈Br (‖·‖K )∩spann G Kx

�I,r ( f ) − �I,r ( f oI,r ) ≤ αI,r (dI,r,n,z) , (5.1)

where αI,r (t) ≤ a1t + a2t2, a1 = 2(rs2K + |y|maxsK ), a2 = s2

K , and dI,r,n,z ≤√�I,γ,n,z

n .ii. If one also has r ≤

√λmin‖y‖2

λmax−λmin, where λmin and λmax are the min-

imum and maximum eigenvalues of K[x] (resp.), then ‖ f oI,r‖2

G Kx≤

( m‖y‖22

[√

λmin‖y‖2−r (λmax−λmin)]2 )r2.

iii. Let εn > 0 and fn be any εn-near minimum point of (Br (‖ · ‖K ) ∩spann G Kx ,�I,r ). Then

∥∥∥ fn − f oI,r

∥∥∥2

K≤ 1

γ (r )

⎛⎝αI,r

⎛⎝

√�I,r,n,z

n

⎞⎠ + εn

⎞⎠ +

(‖ fn‖2

K − r2)

.

iv. For every n > m, let {εn,k} be a sequence of positive reals indexed by k, suchthat limk→+∞εn,k = 0 and { fn,k} a sequence of εn,k-near minimum pointsof (Br (‖ · ‖K ) ∩ spann G Kx ,�I,r ). Then

lim supk→+∞

‖ fn,k − f oI,r‖2

K ≤⎛⎝αI,r

⎛⎝

√�I,r,n,z

n

⎞⎠

⎞⎠ mr√

λmin‖y‖2 − rλmax.

Proof. (i) Let Pz,n f oI,r be any best approximation of f o

I,r on spann G Kx .Then ‖ f o

I,r − Pz,n f oI,r‖K = dI,r,n,z and ‖Pz,n f o

I,r‖K ≤ ‖ f oI,r‖K , so Pz,n f o

I,r ∈Br (‖ · ‖K ) ∩ spann G Kx . Hence, equation 5.1 follows by the definition of themodulus of continuity.

Regularization and Suboptimal Solutions in Learning from Data 813

Let us derive the upper bound on the modulus of continuity αI,r . Forevery f, g ∈ HK (X),

|�I,r ( f ) − �I,r (g)| =∣∣∣∣∣

1m

m∑

i=1

[( f (xi ) − yi )2 − (g(xi ) − yi )2]

∣∣∣∣∣

=∣∣∣∣∣

1m

m∑

i=1

( f (xi ) − g(xi ))( f (xi ) + g(xi ) − 2yi )

∣∣∣∣∣

≤ supx∈X

| f (x) − g(x)|(

supx∈X

∣∣ f (x) + g(x)

∣∣ + 2|y|max

)

.

By the reproducing property 2.1, for every x ∈ X, we get | f (x) − g(x)| =|〈 f − g, Kx〉K | ≤ ‖ f − g‖K

√K (x, x) ≤ sK ‖ f − g‖K , so

|�I,r ( f ) − �I,r (g)| ≤ sK ‖ f − g‖K (sK ‖ f + g‖K + 2|y|max) .

Let ‖ f − g‖K ≤ t. Then ‖g‖K ≤ ‖ f ‖K + t, which, together with ‖ f + g‖K ≤‖ f ‖K + ‖g‖K , implies

|�I,r ( f ) − �I,r (g)| ≤ 2(‖ f ‖K s2K + |y|maxsK )t + s2

K t2. (5.2)

For 0 < r ≤ ‖ f †z ‖K , by proposition 2(i), we have f oT,γ (r ) = f o

I,r , and so‖ f o

T,γ (r )‖K = ‖ f oI,r‖K = r . Hence, αI,r (t) ≤ a1t + a2t2, where a1 = 2(rs2

K +|y|maxsK ) and a2 = s2

K .The upper bound on dI,r,n,z and the expression of �I,r,n,z follow as in the

proof of theorem 2(i).(ii) As 0 < r ≤ ‖ f †z ‖K , by proposition 5.3(i) in Kurkova and Sanguineti

(2005b) and the equivalence between Tikhonov and Ivanov regularizations(see proposition 2), we get

∥∥ f o

I,r

∥∥

G Kx= ∥

∥ f oT,γ (r )

∥∥

G Kx≤

√m‖y‖2

γ (r )m + λmin.

Then by proposition 3 (i), we get√

m‖y‖2γ (r )m+λmin

≤√

m‖y‖2√λmin‖y‖2−rλmax

mr m+λmin, where the

denominator is positive as by assumption r <√

λmin‖y‖2λmax−λmin

. This implies theupper bound on ‖ f o

I,r‖2G Kx

given in the statement.(iii) By proposition 5.1(iii) of Kurkova and Sanguineti (2005b) and

the equivalence between Tikhonov and Ivanov regularizations (see

814 G. Gnecco and M. Sanguineti

proposition 2), we get

‖ fn − f oI,r‖2

K =‖ fn − f oT,γ (r )‖2

K ≤ �T,γ (r )( fn) − �T,γ (r )( f oI,r ) + εn

γ (r )

= �I,r ( fn) − �I,r ( f oI,r ) + εn

γ (r )+ (‖ fn‖2

K − r2) ,

which, combined with case i, gives case iii.(iv) Combining case iii with ‖ fn,k‖K ≤ r and proposition 3, we get

lim supk→+∞

∥∥ fn,k − f o

I,r

∥∥2

K ≤ 1γ (r )

(

αI,r

(√�I,r,n,z

n

))

+ lim supk→+∞

(‖ fn,k‖2K − r2)

≤ 1γ (r )

(

αI,r

(√�I,r,n,z

n

))

≤(

αI,r

(√�I,r,n,z

n

))mr√

λmin‖y‖2 − rλmax.

Theorem 3(i) provides an estimate of the form

inff ∈Br (‖·‖K ) ∩ spann G Kx

�I,r ( f ) − �I,r ( f oI,r )

≤ Aξ�n/2�(G Kx )√

n+ B

(ξ�n/2�(G Kx )

)2

n, (5.3)

for suitable constants A, B > 0 (a worse but simpler upper bound on dI,r,n,z

can be obtained proceeding as described in remark 1).Now suppose that for a fixed r , the data generation model is such that

P{0 < r ≤ ‖ f †z ‖K } ≥ 1 − μI (r, m), (5.4)

for some function μI : (0,+∞) × N → [0, 1). As done in section 4 forTikhonov regularization, if one has an a priori upper bound θ > 0 on theabsolute value of the output y, X is compact, and the kernel K is continuous,then theorem 7 in the appendix (with R1 = r and R2 = max{ θ

sK, r} = νI (r )),

combined with equation 5.4 and the union-bound technique, gives, for

Regularization and Suboptimal Solutions in Learning from Data 815

every r > 0, every m ∈ N, and every τ ∈ (0, 1),

P{

0 < r ≤ ‖ f †z ‖K & supf ∈Br (‖·‖K )

(E( f ) − Ez( f )) ≤ 16s2K νI (r )√

m(r + νI (r ))

+ 4(sK νI (r ))2

√ln(2/τ )

2m

}

≥ 1 − τ − μI (r, m). (5.5)

6 Estimates for Phillips- and Miller-Regularized Learning

For Phillips and Miller regularizations, the proof technique of theorem 3(i)cannot be applied for the following reason (a similar remark holds for theproof technique of theorem 2(i)). In the proof of theorem 3(i), we have shownthatPz,n f o

I,r belongs to the hypothesis set Br (‖ · ‖K ) of Ivanov regularization.However, in general, Pz,n f o

P,η and Pz,n f oM,r,η do not belong to the hypothesis

sets Fz,η and Gz,r,η of Phillips and Miller regularizations, respectively.A different way to obtain rates of approximate optimization in the

Phillips and Miller regularizations is suggested by theorem 4.2(i) inKurkova and Sanguineti (2005a), which, however, requires that 0 is inthe topological interior of Fz,η and Gz,r,η, respectively, whereas in general,even 0 /∈ Fz,η and 0 /∈ Gz,r,η. Note that instead, all the other hypotheses oftheorem 4.2(i) of Kurkova and Sanguineti (2005a) are satisfied. A value of η

sufficiently large would allow guaranteeing 0 ∈ int(Fz,η) and 0 ∈ int(Gz,r,η);however, such a large value might determine an undesired tolerance on thedata misfit of the corresponding regularized solutions. For these reasons,to estimate the accuracy of suboptimal solutions to Phillips and Miller reg-ularizations, we use a different approach, which also has the advantage ofbeing applicable to all four regularization methods. This approach can besummarized in the following steps:

1. If assumption 1 holds, then by theorem 1 the solutions to Tikhonov,Ivanov, Phillips, and Miller regularizations belong to the family Fz offunctions, defined as

Fz ={

f αz =

m∑

i=1

cα,i Kxi : α > 0 and cα = (cα,1, . . . , cα,m)T is the

unique solution of the well-posed linear system

(α m I + K[x])cα = y

}

.

2. For every f αz ∈ Fz, we shall derive upper bounds on ‖ f α

z ‖K andEz( f α

z ).

816 G. Gnecco and M. Sanguineti

3. We shall combine the bounds from step 2 with an upper bound onthe approximation error ‖ f α

z − Pz,n f αz ‖K , where Pz,n f α

z is any bestapproximation of f α

z on spann G Kx .4. For each regularization method, we shall exploit the bounds obtained

in step 3 to find a set of values of α for which the function Pz,n f αz

belongs to the hypothesis set of the corresponding regularizationproblem, and we choose the value of α that gives the smallest upperbound on the corresponding functional.

The next proposition provides the bounds on ‖ f αz ‖K and Ez( f α

z ) requiredin step 2.

Proposition 7. Let X be a nonempty set, K : X × X → R a psd kernel, m apositive integer, z a data sample of size m, and λmin and λmax the minimum andmaximum eigenvalues of K[x], respectively. Then for every f α

z ∈ Fz, the followinghold:

i.√

λminαm+λmax

‖y‖2 ≤ ‖ f αz ‖K ≤

√λmax

αm+λmin‖y‖2.

ii. α2m(αm+λmax)2 ‖y‖2

2 ≤ Ez( f αz ) ≤ α2m

(αm+λmin)2 ‖y‖22.

Proof. (i) We proceed as in the proof of proposition 3(i) (better estimates,with a more complex form, can be obtained by proceeding as in the proofsof proposition 3(ii) or proposition 4). As f α

z = ∑mi=1 cα,i Kxi , by the reproduc-

ing property 2.1, we get ‖ f αz ‖K = √

cTαK[x]cα . Since cα = (α m I + K[x])−1 y,

we have ‖ f αz ‖K ≤ √

λmax‖cα‖2 ≤√

λmaxαm+λmin

‖y‖2 and ‖ f αz ‖K ≥ √

λmin‖cα‖2 ≥√

λminαm+λmax

‖y‖2.ii. By the theory of Tikhonov regularization (see Bertero, 1989, and Engl

et al., 2000), we get

Ez(

f αz

) = 1m

‖Lx L∗x (Lx L∗

x + αI)−1 y − y‖22 = α2

m‖ (Lx L∗

x + αI)−1 y‖22,

where Lx : HK (X) → Rm is the operator defined for every f ∈ HK (X) as

Lx( f ) = ( f (x1)√m , . . . ,

f (xm)√m )T . By proposition 3.1(iv) of Kurkova (2004), the op-

erator Lx L∗x : R

m → Rm can be represented in the canonical basis of R

m bythe matrix 1

mK[x], so we get

Ez( f αz ) = α2m‖ (K[x] + αmI)−1 y‖2

2,

from which the statement easily follows proceeding as in the proof of (i).

The next proposition provides the bounds on ‖Pz,n f αz ‖K and Ez(Pz,n f α

z )required by step 3 by combining the estimates on ‖ f α

z ‖K and Ez( f αz ) from

proposition 7, with an upper bound on ‖ f αz − Pz,n f α

z ‖K :

Regularization and Suboptimal Solutions in Learning from Data 817

Proposition 8. Let X be a nonempty set, K : X × X → R a psd kernel withsK = supx∈X

√K (x, x) < +∞, m a positive integer, z a data sample of size

m, and λmin, λmax the minimum and maximum eigenvalues of K[x], respec-tively. For f α

z ∈ Fz, let Pz,n f αz be a best approximation of f α

z on spannG Kx ,

uα =√

s2K ‖ f α

z ‖2G Kx

− ‖ f αz ‖2

K , and u =√

s2K m − λmin. Then:

i. ‖Pz,n f αz ‖K ≤ ‖ f α

z ‖K + uα√n .

ii. ‖Pz,n f αz ‖K ≤

√λmax‖y‖2

αm+λmin+ ‖y‖2

αm+λmin

u√n .

iii. Ez(Pz,n f αz ) ≤ Ez( f α

z ) + (2‖ f α

z ‖K s2K + |y|maxsK

) uα√n + s2

Ku2

α

n .

iv. Ez(Pz,n f αz ) ≤ α2m‖y‖2

2(αm+λmin)2 + sK |y‖2

αm+λmin

(2sK ‖y‖2αm+λmin

√λmax + |y|max

)u√n

+ s2K ‖y‖2

2

(αm+λmin)2u2

n .

Proof. Let dα,n,z = infg∈spannG Kx‖ f α

z − g‖K = ‖Pz,n f αz − f α

z ‖K .Case i follows by combining ‖Pz,n f α

z ‖K ≤ ‖ f αz ‖K + dα,n,z with the upper

bound on dα,n,z from theorem 3.1(i) of Kurkova and Sanguineti (2005a) (bet-ter estimates, with a more complex expression, can be obtained exploitingthe upper bound on dα,n,z from theorem 1 of Makovoz (1996)).

Case ii follows by combining case i with the upper bound on ‖ f αz ‖K from

proposition 7(i) and the upper bound u2α ≤ (s2

K m−λmin)‖y‖22

(αm+λmin)2 from proposition5.3(iii) of Kurkova and Sanguineti (2005b).

For case iii, by equation 5.2, |Ez(Pz,n f αz ) − Ez( f α

z )| ≤ 2(‖ f αz ‖K s2

K +|y|maxsK )t + s2

K t2 for every t ≥ ‖Pz,n f αz − f α

z ‖K = dα,n,z. We conclude bycombining this with the upper bound on dα,n,z from theorem 3.1(iii) ofKurkova and Sanguineti (2005a).

For case iv, it follows, combining case iii with the upper bounds onEz( f α

z ), ‖ f αz ‖K , and uα by proposition 7(ii) (Kurkova & Sanguineti, 2005b,

propositions 5.3(ii), and 5.3(iii), respectively).

The choice between the bounds given in proposition 8(i) and 8(ii) and,similarly, the choice between those given in proposition 8(iii) and 8(iv)depends on the availability of sufficiently accurate expressions for ‖ f α

z ‖K ,Ez( f α

z ) and ‖ f αz ‖G Kx

as functions of α.

Remark 4. It is worth remarking on the behavior for α → 0 and α → +∞of the bounds provided by proposition 8. For α → 0, one has f α

z → f †zin the norm of the RKHS by the theory of Tikhonov regularization(Engl et al., 2000), Ez( f α

z ) → Ez( f †z ) (as a consequence of the reproducingproperty 2.1), and u2

α = s2K ‖ f α

z ‖2G Kx

− ‖ f αz ‖2

K → s2K ‖ f †z ‖2

G Kx− ‖ f †z ‖2

K (sinceG-variation is a norm—Kurkova, 1997—span G Kx is finite-dimensional,and all norms are equivalent in a finite-dimensional space—Bachmann

& Narici, 2000, theorem 8.7). Thus, setting u†z =

√s2

K ‖ f †z ‖2G Kx

− ‖ f †z ‖2K , for

818 G. Gnecco and M. Sanguineti

α → 0 proposition 8(i) and 8(iii) implies limα→0 ‖Pz,n f αz ‖K ≤ ‖ f †z ‖K + u†

z√n

and limα→0 Ez(Pz,n f αz ) ≤ Ez( f †) + (2‖ f †z ‖K s2

K + |y|maxsK ) u†z√n + s2

Ku†

z2

n . Simi-larly, for α → +∞, one has f α

z → 0, Ez( f αz ) → Ez(0), and u2

α = s2K ‖ f α

z ‖2G Kx

−‖ f α

z ‖2K → 0. So for α → +∞ proposition 8 (ii) and the reproducing property

2.1 imply limα→+∞ ‖Pz,n f αz ‖K = 0 and limα→+∞ Ez(Pz,n f α

z ) = ‖y‖22

m = Ez(0).This analysis of the limit behavior of f α

z allows one to parameterize f αz in

terms of α ∈ [0,+∞].

Finally, step 4 is performed as follows for each regularization method. Asdone in sections 4 and 5, we assume that an a priori upper bound θ > 0 onthe absolute value of the output y is available, that X is compact, and thatthe kernel K is continuous. For simplicity of exposition, we suppose thatthe expressions of ‖ f α

z ‖K and Ez( f αz ) are available; when this is not the case,

one can exploit, instead of the upper bounds given by proposition 8(i) and8(iii), those from proposition 8(ii) and 8(iv), respectively. Finally, withoutloss of generality we suppose that the minima of equations 6.1, 6.3, 6.6, and6.10 with respect to α in the respective sets are achieved (otherwise, oneconsiders ε-near minimum points).

6.1 Tikhonov Regularization. For γ > 0 and a positive integer n < m,we search for α ∈ [0,+∞] that minimizes

Ez(

f αz

) + (2‖ f α

z ‖K s2K + |y|maxsK

) uα√n

+ s2K

u2α

n+ γ

(

‖ f αz ‖K + uα√

n

)2

.

(6.1)

By remark 4, for α = +∞ we have

Ez( f +∞z ) + (2‖ f +∞

z ‖K s2K + |y|maxsK )

u+∞√n

+ s2K

u2+∞n

+

+ γ(‖ f +∞

z ‖K + u+∞√n

)2= Ez(0) ≤ θ2.

So ‖Pz,n f αz ‖K ≤ ‖ f α

z ‖K ≤ θ√γ

, and by proposition 8(i) and 8(iii), we get

inff ∈spann G Kx ∩Bθ/

√γ (‖·‖K )

�T,γ ( f ) ≤ �T,γ

(Pz,n f α

z

)

≤ Ez( f αz ) + (

2‖ f αz ‖K s2

K + |y|maxsK) uα√

n

+ s2K

u2α

n+ γ

(

‖ f αz ‖K + uα√

n

)2

. (6.2)

Regularization and Suboptimal Solutions in Learning from Data 819

Recall that for every γ > 0, every positive integer m, and every τ ∈ (0, 1),the SLT bound, equation 4.7 holds.

6.2 Ivanov Regularization. For r > 0 and a positive integer n < m, con-sider the set Ir = {α ∈ [0,+∞] : ‖ f α

z ‖K + uα√n ≤ r}. By proposition 8(i), α ∈

Ir implies ‖Pz,n f αz ‖K ≤ r and by remark 4, Ir �= ∅ (indeed, α = +∞ ∈ Ir ).

Then we search for α ∈ Ir that minimizes

Ez( f αz ) + (

2‖ f αz ‖K s2

K + |y|maxsK) uα√

n+ s2

Ku2

α

n. (6.3)

By proposition 8(iii), we get

inff ∈Br (‖·‖K )∩spann G Kx

�I,r ( f ) ≤ �I,r(Pz,n f α

z

)

≤ Ez( f αz ) + (

2‖ f αz ‖K s2

K + |y|maxsK) uα√

n+ s2

Ku2

α

n.

Following the same arguments as at the end of section 5, for every r > 0,every m ∈ N, and every τ ∈ (0, 1), we have the SLT bound

P{

supf ∈Br (‖·‖K )

(E( f ) − Ez( f ))

≤ 16s2K νI (r )√

m(r + νI (r )) + 4(sK νI (r ))2

√ln(2/τ )

2m

}

.

≥ 1 − τ. (6.4)

6.3 Miller Regularization. Suppose that the data generation model issuch that for r > 0, η ≥ 0, and positive integers m and n < m, the set

Az,r,η,n ={α ∈ [0,+∞] :

∥∥ f α

z

∥∥

K + uα√n

≤ r & Ez(

f αz

)

+ (2∥∥ f α

z

∥∥

K s2K + |y|maxsK

) uα√n

+ s2K

u2α

n≤ η2

}(6.5)

is nonempty with a priori probability at least 1 − μM(r, η, m, n), for a suit-able [0, 1)-valued function μM(r, η, m, n) (the condition Az,r,η,n �= ∅ impliesGz,η,r �= ∅). Then we search for α ∈ Az,r,η,n that minimizes

Ez(

f αz

) + (2∥∥ f α

z

∥∥

K s2K + |y|maxsK

) uα√n

+ s2K

u2α

n+

r

)2(∥∥ f α

z

∥∥

K + uα√n

)2

.

(6.6)

820 G. Gnecco and M. Sanguineti

By proposition 8(i) and 8(iii), we get

inff ∈Gz,r,η ∩ spann G Kx

�M,r,η( f ) ≤ �M,r,η(Pz,n f α

z

)

≤ Ez( f αz ) + (

2‖ f αz ‖K s2

K + |y|maxsK) uα√

n+ s2

Ku2

α

n

+(η

r

)2(

‖ f αz ‖K + uα√

n

)2

. (6.7)

Following the same arguments used to derive equation 5.5, for every r >

0, every η ≥ 0, every positive integer n, m with n < m, and every τ ∈ (0, 1),we have the SLT bound

P{

Az,r,η,n �= ∅ & supf ∈Br (‖·‖K )

(E( f ) − Ez( f )) ≤ 16s2K νI (r )√

m(r + νI (r ))

+ 4(sK νI (r ))2

√ln(2/τ )

2m

}

≥ 1 − τ − μM(r, η, m, n). (6.8)

The estimates derived above for Tikhonov, Ivanov, and Miller regular-izations can be refined by applying the techniques mentioned in remark 3and making suitable assumptions on the data generation model. The esti-mates that we derive for Phillips regularization employ a refinement basedon remark 3.

6.4 Phillips Regularization. This case requires a different approachfrom the previous ones. Indeed, in Phillips regularization, one has a givenupper bound on the square of the empirical error Ez, and one aims tominimize ‖ · ‖2

K under such a constraint. This can be interpreted as follows.Fix two sequences {r j > 0} and {a j > 0} such that

∑+∞j=1 a j = 1. Then an SLT

bound of the form

P{

∀ j ∈ N, ∀ f ∈ Br j (‖ · ‖K ) : (E( f ) − Ez( f )) ≤ 16s2K νI (r j )√

m(r j + νI (r j ))

+ 4(sK νI (r j ))2

√ln(2/(a jτ ))

2m

}

≥ 1 − τ (6.9)

can be easily derived (compare it with equations 4.8 and 5.5). Supposethat {r j } is an increasing sequence, lim j→∞ r j = +∞ and {a j } is a decreas-ing sequence. Note that the sequence {νI (r j )} = {max{ θ

sK, r j }} is nonde-

creasing. Then, when looking for a suboptimal solution to problem Pη

in Fz,η ∩ spann G Kx , the smallest upper bound on E( f ) − Ez( f ) in equa-tion 6.9 is obtained for the minimum r j in the sequence such that the set

Regularization and Suboptimal Solutions in Learning from Data 821

Fz,η ∩ spann G Kx ∩ Br j (‖ · ‖K ) is nonempty (one might search for the small-est upper bound on E( f ) on the whole spann G Kx , but it does not correspondto Phillips regularization in strict sense).

For η ≥ 0 and a positive integer n < m, consider the set Aη ={α ∈ [0,+∞] : Ez( f α

z ) + (2‖ f α

z ‖K s2K + |y|maxsK

) uα√n + s2

Ku2

α

n ≤ η2}. By propo-sition 8(iii), α ∈ Aη implies Ez(Pz,n f α

z ) ≤ η2, and by remark 4, if the kernel ispd and x has no repeated entries xi , then Aη �= ∅ (indeed, α = 0 ∈ Aη). Thenwe search for α ∈ Aη that minimizes

‖ f αz ‖K + uα√

n. (6.10)

So by proposition 8(i), we get

inff ∈Fz,η ∩ spann G Kx

�P,η( f ) ≤ �P,η(Pz,n f αz ) ≤ ‖ f α

z ‖K + uα√n

. (6.11)

We conclude this section by considering suboptimal solutions inspann G K , for n < m.

For the gaussian kernel, the next proposition shows that in general, itis not possible to express a function of the form f = ∑m

i=1 ci Kxi as a finitelinear combination of gaussians with centers not all equal to the m datapoints xi , i = 1, . . . , m, unless an approximation error is introduced. Moregenerally, the proposition holds for every pd kernel. Linear independenceof gaussians with varying widths and centroids was proven in Kurkovaand Neruda (1994).

Proposition 9. Let K be the gaussian kernel, xi ∈ Rd , i = 1, . . . , m, and f =∑m

i=1 ci Kxi such that at least one ci is different from 0. Then for every positiveinteger k, every {c j ∈ R : j = 1, . . . , k}, and every set {x j ∈ R

d : j = 1, . . . , k},if {xi : i = 1, . . . , m and ci �= 0} � {x j : j = 1, . . . , k} there exists x ∈ {xi } ∪{x j } such that

m∑i=1

ci Kxi (x) �=k∑

j=1

c j Kxj (x).

Proof. The statement is a consequence of the fact that for every positiveinteger h, h gaussians with equal widths but different centers form a setof linearly independent functions. Indeed, for any h-tuple of input data,the corresponding Gram matrix has full rank (Scholkopf & Smola, 2002,theorem 2.18).

However, studying approximation error bounds achievable by familiesof functions of the form

∑kj=1 c j Kxj is still of interest, especially when

822 G. Gnecco and M. Sanguineti

k = n � m. Inspection of the proofs of our results shows that with the ex-ception of proposition 6, they still hold if one considers spann G K instead ofspann G Kx , as spann G Kx ⊆ spann G K . Estimating the improvement obtainedby using the larger hypothesis set M ∩ spann G K instead of M ∩ spann G Kx

may be a subject of future research.

7 On Algorithms for Sparse Suboptimal Solutions

Various algorithms have been proposed to find sparse suboptimal solutionsto approximation and optimization problems. In the following, we brieflyfocus on those useful for the regularization techniques considered in thisletter.

The context common to all such algorithms is the following. Given a (typ-ically redundant) set D of functions, called dictionary, which are elementsof a finite- or infinite-dimensional Hilbert space H, for a “small” positiveinteger n, one aims to find an accurate suboptimal solution f s

n from spann Dto a function approximation problem or, more generally, to a functionaloptimization problem.

When no structure is imposed on the dictionary D, the problem of find-ing the best approximation of a function f ∈ H from spann D is NP-hard(Davis, Mallat, & Avellaneda, 1997). However, the problem may drasticallysimplify when the elements of the dictionary have a suitable structure. Thesimplest situation arises when they are orthogonal. The case of a dictionarywith nearly orthogonal elements, that is, a dictionary with small coher-ence (Gribonval & Vandergheynst, 2006), stays halfway between these twoextremes and provides a computationally tractable problem, for which con-structive approximation results are available (Das & Kempe, 2008; Gilbert,Muthukrishnan, & Strauss, 2003; Gribonval & Vandergheynst, 2006; Tropp,2004). As in our context D = G Kx , we may want to choose a kernel K suchthat the dictionary G Kx has small coherence. If this is not possible, then, asin Honeine et al. (2007), one may consider as dictionary a suitable subset ofG Kx with a small coherence.

In the remaining of this section, we discuss three families of algorithmsto derive sparse suboptimal solutions.

7.1 Greedy Algorithms. Starting from an initial sparse suboptimal so-lution f s

n with a small n (usually n = 0 and f s0 = 0), typically greedy al-

gorithms obtain inductively an (n + 1)-term suboptimal solution f sn+1 as a

linear combination of the n-term one f sn and a new element from the dic-

tionary. So a sequence of low-dimensional optimization problems has tobe solved. Depending on how such problems are defined, different kindsof greedy algorithms are obtained (see, e.g., Zhang, 2002a, 2002b, 2003).These algorithms are particularly suitable to derive sparse suboptimal so-lutions to Tikhonov regularization; however, in some cases, their rates ofapproximate optimization are worse than our estimate 4.3, as discussed in

Regularization and Suboptimal Solutions in Learning from Data 823

section 4. Moreover, the rates derived in Zhang (2002a, 2002b, 2003) do notapply to Ivanov, Miller, and Phillips regularizations (the possibility of ex-tending Zhang, 2002b, lemma 3.1, to such regularizations may be a subjectof future research; see also remark 2).

Remarkable properties were proven for matching pursuit and orthog-onal matching pursuit; SLT bounds for their kernel versions, known askernel matching pursuit (Vincent & Bengio, 2002), were derived in Hus-sain and Shawe-Taylor (2009, theorem 3). Given f ∈ H and provided thatthe positive integer n and the dictionary D are suitably chosen, the n-termapproximations found by these two algorithms are only a well-defined fac-tor C(n) > 0 worse than the best approximation of f in terms of any nelements of D (see Gilbert et al., 2003, theorem 2.1; Tropp, 2004, theorem2.6; Gribonval & Vandergheynst, 2006, featured theorem 3; Das & Kempe,2008, theorem 3.5) for some values of C(n) and estimates on the numberof the iterations). In the framework of section 6 and under the assump-tion that for a given α > 0, the function f α

z is known,4 kernel versionsof matching pursuit and orthogonal matching pursuit may be applied tofind its sparse approximation f s

z,n. If ‖ f αz − f s

z,n‖K ≤ C(n)‖ f αz − Pz,n f α

z ‖K ,then ‖Pz,n f α

z − f sz,n‖K ≤ (C(n) + 1)‖ f α

z − Pz,n f αz ‖K . Note that the results

of section 6 provide upper bounds on ‖ f αz − Pz,n f α

z ‖K . The differences|Ez( f α

z ) − Ez( f sz,n)| and |Ez(Pz,n f α

z ) − Ez( f sz,n)| can be bounded from above

in terms of the modulus of continuity of the functional Ez.

7.2 Algorithms Based on Low-Rank Approximation of the Gram Ma-trix. Another possibility is to replace the Gram matrix by a low-rankapproximation by using greedy algorithms or randomization techniques(Smola & Scholkopf, 2000; Drineas & Mahoney, 2005). Low-rank approxi-mations can be used, for example, to find a sparse suboptimal solution toTikhonov regularization (Smola & Scholkopf, 2000).

7.3 Algorithms Based on Convex Formulations of the Problem. Also,when the functional to be minimized is convex and defined on a convex set,when suboptimal solutions in spann D are searched for, the correspondingoptimization problem may not be convex any more. Then one may considera related convex optimization problem with sparse optimal solutions, forwhich efficient convex optimization algorithms can be exploited. In linearregression, for example, adding an upper bound on the l1-norm of the coeffi-cients instead of their l2-norm (or an l1 penalization term instead of an l2 one)is known to enforce the sparseness of the solution. This is the least absoluteshrinkage and selection operator (LASSO) problem (Tibshirani, 1996), for

4The algorithms in Das and Kempe (2008), Gilbert et al. (2003), Gribonval and Van-dergheynst (2006), and Tropp (2004) require the knowledge of the scalar products betweenthe function to be approximated (in our case f α

z ) and the elements of the dictionary (inour case G Kx ).

824 G. Gnecco and M. Sanguineti

which kernel versions have been proposed (Roth, 2004). In Efron, Hastie,Johnstone, and Tibshirani (2004), an algorithm well-suited to LASSO wasproposed, which shows how the degree of sparseness of its solution is con-trolled by varying the regularization parameter. Another computationallypromising technique to solve LASSO is based on the operator-splitting ap-proach studied in Hale, Yin, and Zhang (2008). Some limitations of LASSOhave been overcome by its extension called elastic net (Zou & Hastie, 2005;De Mol, De Vito, & Rosasco, 2009), where the l1 and l2 penalization termsare simultaneously present. Zou and Hastie (2005), showed that the elasticnet can be considered as a LASSO on an extended artificial data set, so thatthe algorithms from Efron et al. (2004) can be still applied.

Appendix: SLT Bounds in Terms of Rademacher’s Complexity

For a loss function V and a family F of functions f : X → R, typically SLTbounds have either the two-sided form

P

{

supf ∈F

|EV( f ) − Ez,V( f )| ≤ δ(m, q , τ )

}

≥ 1 − τ

or the one-sided form

P

{

supf ∈F

(EV( f ) − Ez,V( f )) ≤ δ(m, q , τ )

}

≥ 1 − τ,

where P denotes the a priori probability with respect to the random draw ofthe i.i.d. data sample z of size m, τ ∈ (0, 1), δ(m, q , τ ) is a suitable nonnega-tive function, and the parameter q measures the complexity of the family F(e.g., its Rademacher’s complexity, its covering numbers, its VC dimension).We are interested in SLT bounds formulated in terms of the Rademacher’scomplexity of F , defined, for a probability measure ρX generating the i.i.d.input data sample x = (x1, . . . , xm) ∈ Xm and m independent, uniformlydistributed, and {±1}-valued random variables σ = {σ1, . . . , σm}, as

Rm(F) = ExEσ

{

supf ∈F

∣∣∣∣

2m

m∑

i=1

σi f (xi )∣∣∣∣

∣∣∣ x1, . . . , xm

}

(A.1)

(see Bartlett & Mendelson, 2002; Shawe-Taylor & Cristianini, 2004; aslightly different definition is given in Mendelson, 2003). For a givenx ∈ Xm, the empirical Rademacher’s complexity of F is defined as

Rm(F) = Eσ

{

supf ∈F

∣∣∣∣

2m

m∑

i=1

σi f (xi )∣∣∣∣

∣∣∣ x1, . . . , xm

}

. (A.2)

Regularization and Suboptimal Solutions in Learning from Data 825

Thus,

Rm(F) = Ex{Rm(F)}. (A.3)

The following result is a restatement of Bartlett and Mendelson (2002,theorem 8) and Shawe-Taylor and Cristianini (2004, theorem 4.9).

Theorem 5. Let L > 0, V : R2 → [0, L] a loss function, F a family of functions

f : X → R, z an i.i.d. data sample of size m, and V ◦ F = {(x, y) → V( f (x), y) −V(0, y) : f ∈ F}. Then for every positive integer m and every τ ∈ (0, 1),

P

{supf ∈F

EV( f ) − Ez,V( f )L

≤ 1L

Rm(V ◦ F) +√

ln(2/τ )2m

≤ 1L

Rm(V ◦ F) + 3

√ln(2/τ )

2m

}≥ 1 − τ, (A.4)

where P denotes the a priori probability with respect to the random draw of thedata.

When V is the square-loss function and F is a bounded subset of a RKHSHK (X), Theorem 5 gives the following corollary, which is a restatementof Shawe-Taylor and Cristianini (2004, theorem 7.39).5 Its proof exploitsseveral structural results on Rademacher’s complexity and its empiricalversion (see Bartlett & Mendelson, 2002, theorem 12, or theorem 4.15) and aprobabilistic upper bound on the empirical Rademacher’s complexity of theball BR(‖ · ‖K ) = { f ∈ HK (X) : ‖ f ‖K ≤ R} of radius R in the RKHS HK (X)(see Bartlett & Mendelson, 2002, lemma 22, or Shawe-Taylor & Cristianini,2004, theorem 4.12):

Theorem 6. Let X be a nonempty set, K : X × X → R a psd kernel, sK =supx∈X

√K (x, x) < +∞, 0 < R1 ≤ R2, z an i.i.d. data sample of size m, and

assume that the support of the marginal probability measure ρY associated with therandom variable y is a subset of [−sK R2, sK R2]. Then for every positive integer mand every τ ∈ (0, 1),

P

{sup

f ∈BR1 (‖·‖K )(E( f ) − Ez( f )) ≤ 16sK R2

m

(R1

√tr (K[x]) + ‖y‖2

)

+ 12(sK R2)2

√ln(2/τ )

2m

}≥ 1 − τ,

5In the statement from Shawe-Taylor and Cristianini (2004, theorem 7.39), there isonly one constant R > 0 instead of the two positive constants R1, R2, that is, one takesR1 = R2 = R > 0 in theorem 6. Inspection of the proof (see Shawe-Taylor & Cristianini,2004, pp. 231–232); shows that one can restate the theorem in terms of two differentconstants R1 and R2, provided that 0 < R1 ≤ R2.

826 G. Gnecco and M. Sanguineti

where P denotes the a priori probability with respect to the random draw of thedata.

The bound in theorem 6 contains a data-dependent term(R1

√tr(K[x]) +

‖y‖2), which disappears in the following slight variation of the estimate:

Theorem 7. Under the same assumptions of theorem 6, if X is compact and K iscontinuous, then

P

{sup

f ∈BR1 (‖·‖K )(E( f ) − Ez( f )) ≤ 16s2

K R2√m

(R1 + R2) + 4(sK R2)2

√ln(2/τ )

2m

}

≥ 1 − τ.

Proof. In the proof of Shawe-Taylor and Cristianini (2004, theorem 7.39),the probabilistic bound sup f ∈F

EV ( f )−Ez,V ( f )L ≤ 1

L Rm(V ◦ F) + 3√

ln(2/τ )2m from

equation A.4 is exploited. Using instead the bound sup f ∈FEV ( f )−Ez,V ( f )

L ≤1L Rm(V ◦ F) +

√ln(2/τ )

2m , again from equation A.4, together with equa-tion A.2, and then proceeding as in the proof of Shawe-Taylor and Cris-

tianini (2004, theorem 7.39), we replace the term 2R1

√tr(K[x])m by 2R1

√tr(TK )√m (as

in Bartlett & Mendelson, 2002, section 4.3), where TK is the integral operatordefined as TK ( f ) = ∫

X K (x, t) f (t)dρX(t). By the corollary to the Mercer the-orem stated in Cucker and Smale (2001), we get tr(TK ) = ∫

X K (t, t)dρX(t) ≤s2

K . Similarly, by equation A.2 and the assumption on the support of ρY, theterm ‖y‖2/m is replaced by sK R2/

√m.

Acknowledgments

We are grateful to an anonymous reviewer for comments and suggestionsthat allowed us to improve the original manuscript. We were partially sup-ported by the grant Progetti di Ricerca di Ateneo 2008, from the Universityof Genova for the project Solution of Functional Optimization Problems byNonlinear Approximators and Learning from Data.

References

Aronszajn, N. (1950). Theory of reproducing kernels. Trans. of AMS, 68, 337–404.Bachmann, G., & Narici, L. (2000). Functional analysis. New York: Dover.Bartlett, P. L., & Mendelson, S. (2002). Rademacher and gaussian complexities: Risk

bounds and structural results. J. Machine Learning Research, 3, 463–482.Berg, C., Christensen, J. P. R., & Ressel, P. (1984). Harmonic analysis on semigroups.

New York: Springer.

Regularization and Suboptimal Solutions in Learning from Data 827

Bertero, M. (1989). Linear inverse and ill-posed problems. Advances in Electronics andElectron Physics, 75, 1–120.

Bousquet, O., & Elisseeff, A. (2002). Stability and generalization. J. Machine LearningResearch, 2, 499–526.

Cucker, F., & Smale, S. (2001). On the mathematical foundations of learning. Bulletinof AMS, 39, 1–49.

Cucker, F., & Smale, S. (2002). Best choices for regularization parameters in learningtheory: On the bias-variance problem. Foundations of Computational Mathematics,2, 413–428.

Das, A., & Kempe, D. (2008). Algorithms for subset selection in linear regression. InProc. 40th Annual ACM Symposium on Theory of Computing (pp. 45–54). New York:ACM.

Davis, G., Mallat, S., & Avellaneda, M. (1997). Adaptive greedy approximations.Constructive Approximation, 13, 57–98.

De Mol, C., De Vito, E., & Rosasco, L. (2009). Elastic-net regularization in learningtheory. J. Complexity, 25, 201–230.

De Vito, E., Rosasco, L., Caponnetto, A., De Giovannini, U., & Odone, F. (2005).Learning from examples as an inverse problem. J. Machine Learning Research, 6,883–904.

Dontchev, A. L., & Zolezzi, T. (1993). Well-posed optimization problems. Berlin: Springer.Drineas, P., & Mahoney, M. W. (2005). On the Nystrom method for approximating a

Gram matrix for improved kernel-based learning. J. Machine Learning Research, 6,2153–2175.

Efron, B., Hastie, T., Johnstone, L., & Tibshirani, R. (2004). Least angle regression.Annals of Statistics, 32, 407–499.

Engl, H. W., Hanke, M., & Neubauer, A. (2000). Regularization of inverse problems.Norwood, MA: Kluwer.

Evgeniou, T., Pontil, M., & Poggio, T. (2000). Regularization networks and supportvector machines. Advances in Computational Mathematics, 13, 1–50.

Friedman, A. (1982). Foundations of modern analysis. New York: Dover.Gilbert, A. C., Muthukrishnan, S., & Strauss, M. J. (2003). Approximation of functions

over redundant dictionaries using coherence. In Proc. 14th Annual ACM-SIAMSymposium on Discrete Algorithms (pp. 243–252). Philadelphia: SIAM.

Girosi, F. (1994). Regularization theory, radial basis functions and networks. InV. Cherkassky, J. H. Friedman, & H. Wechsler (Eds.), From statistics to neuralnetworks: Theory and pattern recognition applications (pp. 166–187). Berlin: Springer.

Girosi, F. (1998). An equivalence between sparse approximation and support vectormachines. Neural Computation, 10, 1455–1480.

Gribonval, R., & Vandergheynst, P. (2006). On the exponential convergence of match-ing pursuits in quasi-incoherent dictionaries. IEEE Trans. on Information Theory,52, 255–261.

Groetch, C. W. (1977). Generalized inverses of linear operators. New York: Dekker.Hale, E. T., Yin, W., & Zhang, Y. (2008). Fixed-point continuation for �1-minimization:

Methodology and convergence. SIAM J. Optimization, 19, 1107–1130.Honeine, P., Richard, C., & Bermudez, J. C. M. (2007). On-line nonlinear sparse

approximation of functions. In Proc. IEEE Int. Symposium on Information Theory(pp. 956–960). Piscataway, NJ: IEEE Press.

828 G. Gnecco and M. Sanguineti

Hussain, Z., & Shawe-Taylor, J. (2009). Theory of matching pursuit. In D. Koller,D. Schuurmans, Y. Bengio, and L. Bottou (Eds.), Advances in neural informationprocessing systems, 21. Cambridge, MA: MIT Press.

Kolmogorov, A. N., & Fomin, S. V. (1975). Introductory real analysis. New York: Dover.Kurkova, V. (1997). Dimension-independent rates of approximation by neural

networks. In K. Warwick & M. Karny (Eds.), Computer-intensive methods incontrol and signal processing: The curse of dimensionality (pp. 261–270). Boston:Birkhauser.

Kurkova, V. (2004). Learning from data as an inverse problem. In J. Antoch (Ed.),Proc. Int. Conf. on Computational Statistics (pp. 1377–1384). Heidelberg: Phys-ica/Springer.

Kurkova, V. (2005). Neural network learning as an inverse problem. Logic Journal ofIGPL, 13, 551–559.

Kurkova, V., & Neruda, R. (1994). Uniqueness of functional representations bygaussian basis function networks. In Proc. ICANN’94 (pp. 471–474). New York:Springer.

Kurkova, V., & Sanguineti, M. (2001). Bounds on rates of variable-basis and neural-network approximation. IEEE Trans. on Information Theory, 47, 2659–2665.

Kurkova, V., & Sanguineti, M. (2002). Comparison of worst case errors in linear andneural network approximation. IEEE Trans. on Information Theory, 48, 264–275.

Kurkova, V., & Sanguineti, M. (2005a). Error estimates for approximate optimizationby the extended Ritz method. SIAM J. Optimization, 15, 461–487.

Kurkova, V., & Sanguineti, M. (2005b). Learning with generalization capability bykernel methods of bounded complexity. J. Complexity, 21, 350–367.

Limaye, B. V. (1996). Functional analysis. New Delhi: New Age Publishers.Makovoz, Y. (1996). Random approximants and neural networks. J. Approximation

Theory, 85, 98–109.Mendelson, S. (2003). A few notes on statistical learning theory. In S. Mendelson &

A. Smola (Eds.), Lecture notes in computer science (pp. 1–40). New York: Springer.Miller, K. (1970). Least squares methods for ill-posed problems with a prescribed

bound. SIAM J. Mathematical Analysis, 1, 52–74.Mukherjee, S., Rifkin, R., & Poggio, T. (2002). Regression and classification with

regularization. In D. Denison, M. H. Hansen, C. C. Holmes, B. Mallick, and B. Yu(Eds.), Lectures notes in statistics: Nonlinear estimation and classification (Proc. MSRIWorkshop) (pp. 107–124). New York: Springer.

Ortega, J. M. (1990). Numerical analysis: A second course. Philadelphia: SIAM.Rauch, J. (1991). Partial differential equations. New York: Springer.Roth, V. (2004). The generalized Lasso. IEEE Trans. on Neural Networks, 15, 16–28.Scholkopf, B., & Smola, A. J. (2002). Learning with kernels—Support vector machines,

regularization, optimization, and beyond. Cambridge, MA: MIT Press.Serre, D. (2002). Matrices—Theory and applications. New York: Springer.Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., & Anthony, M. (1998). Structural

risk minimization over data-dependent hierarchies. IEEE Trans. on InformationTheory, 44, 1926–1940.

Shawe-Taylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis. Cam-bridge: Cambridge University Press.

Regularization and Suboptimal Solutions in Learning from Data 829

Smola, A. J., & Scholkopf, B. (2000). Sparse greedy matrix approximation for machinelearning. In Proc. 17th Int. Conf. on Machine Learning (pp. 911–918). San Francisco:Morgan Kaufmann.

Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. J. RoyalStatistical Society, Series B, 58, 267–288.

Tikhonov, A. N., & Arsenin, V. Y. (1977). Solutions of ill-posed problems. Washington,DC: Winston.

Tropp, J. A. (2004). Greed is good: Algorithmic results for sparse approximation.IEEE Trans. on Information Theory, 50, 2231–2242.

Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley.Vasin, V. V. (1970). Relationship of several variational methods for the approximate

solution of ill-posed problems. Mathematical Notes, 7, 161–165.Vincent, P., & Bengio, Y. (2002). Kernel matching pursuit. Machine Learning, 48, 165–

187.Wahba, G. (1990). Spline models for observational data. Philadelphia: SIAM.Zhang, T. (2002a). Approximation bounds for some sparse kernel regression algo-

rithms. Neural Computation, 14, 3013–3042.Zhang, T. (2002b). A general greedy approximation algorithm with applications. In

T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural informationprocessing systems, 14. Cambridge, MA: MIT Press.

Zhang, T. (2003). Sequential greedy approximation for certain convex optimizationproblems. IEEE Trans. on Information Theory, 49, 682–691.

Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net.J. Royal Statistical Society B, 67, 301–320.

Received May 16, 2008; accepted June 23, 2009.