Research Article Kernel Sliced Inverse Regression...

12
Hindawi Publishing Corporation Abstract and Applied Analysis Volume 2013, Article ID 540725, 11 pages http://dx.doi.org/10.1155/2013/540725 Research Article Kernel Sliced Inverse Regression: Regularization and Consistency Qiang Wu, 1 Feng Liang, 2 and Sayan Mukherjee 3 1 Department of Mathematical Sciences, Middle Tennessee State University, Murfreesboro, TN 37130, USA 2 Department of Statistics, University of Illinois at Urbana-Champaign, Urbana, IL 61820, USA 3 Departments of Statistical Science, Mathematics, and Computer Science, Institute for Genome Sciences & Policy, Duke University, Durham, NC 27708, USA Correspondence should be addressed to Qiang Wu; [email protected] Received 3 May 2013; Accepted 14 June 2013 Academic Editor: Yiming Ying Copyright © 2013 Qiang Wu et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Kernel sliced inverse regression (KSIR) is a natural framework for nonlinear dimension reduction using the mapping induced by kernels. However, there are numeric, algorithmic, and conceptual subtleties in making the method robust and consistent. We apply two types of regularization in this framework to address computational stability and generalization performance. We also provide an interpretation of the algorithm and prove consistency. e utility of this approach is illustrated on simulated and real data. 1. Introduction e goal of dimension reduction in the standard regression/ classification setting is to summarize the information in the -dimensional predictor variable relevant to predicting the univariate response variable . e summary () should have variates and ideally should satisfy the following conditional independence property: : | () . (1) us, any inference of involves only the summary statistic () which is of much lower dimension than the original data . Linear methods for dimension reduction focus on linear summaries of the data, () = ( 1 , . . . , ). e - dimensional subspace, S = span( 1 ,..., ), is defined as the effective dimension reduction (e.d.r.) space in [1], since S summarizes all the predictive information on . A key result in [1] is that under some mild conditions, the e.d.r. directions { } =1 correspond to the eigenvectors of the matrix: =[cov ()] −1 cov [E ( | )] . (2) us, the e.d.r. directions or subspace can be estimated via an eigenanalysis of the matrix , which is the foundation of the sliced inverse regression (SIR) algorithm proposed in [1, 2]. Further developments include sliced average variance estima- tion (SAVE) [3] and Principal Hessian directions (PHDs) [4]. e aforementioned algorithms cannot be applied in high- dimensional settings, where the number of covariates is greater than the number of observations , since the sample covariance matrix is singular. Recently, an extension of SIR has been proposed in [5], which can handle the case for > based on the idea of partial least squares. A common premise held in high-dimensional data anal- ysis is that the intrinsic structure of data is in fact low dimensional, for example, the data is concentrated on a manifold. Linear methods such as SIR oſten fail to capture this nonlinear low-dimensional structure. However, there may exist a nonlinear embedding of the data into a Hilbert space, where a linear method can capture the low-dimensional structure. e basic idea in applying kernel methods is the application of a linear algorithm to the data mapped into a feature space induced by a kernel function. If projections onto this low-dimensional structure can be computed by inner products in this Hilbert space, the so-called kernel trick [6, 7] can be used to obtain simple and efficient algorithms. Since the embedding is nonlinear, linear directions in the feature space correspond to nonlinear directions in the original data

Transcript of Research Article Kernel Sliced Inverse Regression...

Page 1: Research Article Kernel Sliced Inverse Regression ...downloads.hindawi.com/journals/aaa/2013/540725.pdf · high-dimensional space [ ]. is conforms to the argument in [ ] that the

Hindawi Publishing CorporationAbstract and Applied AnalysisVolume 2013 Article ID 540725 11 pageshttpdxdoiorg1011552013540725

Research ArticleKernel Sliced Inverse RegressionRegularization and Consistency

Qiang Wu1 Feng Liang2 and Sayan Mukherjee3

1 Department of Mathematical Sciences Middle Tennessee State University Murfreesboro TN 37130 USA2Department of Statistics University of Illinois at Urbana-Champaign Urbana IL 61820 USA3Departments of Statistical Science Mathematics and Computer Science Institute for Genome Sciences amp PolicyDuke University Durham NC 27708 USA

Correspondence should be addressed to Qiang Wu qwumtsuedu

Received 3 May 2013 Accepted 14 June 2013

Academic Editor Yiming Ying

Copyright copy 2013 Qiang Wu et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Kernel sliced inverse regression (KSIR) is a natural framework for nonlinear dimension reduction using the mapping induced bykernels However there are numeric algorithmic and conceptual subtleties in making the method robust and consistent We applytwo types of regularization in this framework to address computational stability and generalization performance We also providean interpretation of the algorithm and prove consistency The utility of this approach is illustrated on simulated and real data

1 Introduction

The goal of dimension reduction in the standard regressionclassification setting is to summarize the information in the119901-dimensional predictor variable119883 relevant to predicting theunivariate response variable 119884 The summary 119878(119883) shouldhave 119889 ≪ 119901 variates and ideally should satisfy the followingconditional independence property

119884119883 | 119878 (119883) (1)Thus any inference of 119884 involves only the summary statistic119878(119883)which is ofmuch lower dimension than the original data119883

Linear methods for dimension reduction focus on linearsummaries of the data 119878(119883) = (120573

119879

1119883 120573

119879

119889119883) The 119889-

dimensional subspace S = span(1205731 120573

119889) is defined as

the effective dimension reduction (edr) space in [1] sinceSsummarizes all the predictive information on 119884 A key resultin [1] is that under somemild conditions the edr directions120573119895119889

119895=1correspond to the eigenvectors of the matrix

119879 = [cov (119883)]minus1 cov [E (119883 | 119884)] (2)Thus the edr directions or subspace can be estimated via aneigenanalysis of the matrix 119879 which is the foundation of the

sliced inverse regression (SIR) algorithm proposed in [1 2]Further developments include sliced average variance estima-tion (SAVE) [3] and Principal Hessian directions (PHDs) [4]The aforementioned algorithms cannot be applied in high-dimensional settings where the number of covariates 119901 isgreater than the number of observations 119899 since the samplecovariance matrix is singular Recently an extension of SIRhas been proposed in [5] which can handle the case for 119901 gt 119899based on the idea of partial least squares

A common premise held in high-dimensional data anal-ysis is that the intrinsic structure of data is in fact lowdimensional for example the data is concentrated on amanifold Linearmethods such as SIR often fail to capture thisnonlinear low-dimensional structure However there mayexist a nonlinear embedding of the data into a Hilbert spacewhere a linear method can capture the low-dimensionalstructure The basic idea in applying kernel methods is theapplication of a linear algorithm to the data mapped into afeature space induced by a kernel function If projections ontothis low-dimensional structure can be computed by innerproducts in this Hilbert space the so-called kernel trick [6 7]can be used to obtain simple and efficient algorithms Sincethe embedding is nonlinear linear directions in the featurespace correspond to nonlinear directions in the original data

2 Abstract and Applied Analysis

space Nonlinear extensions of some classical linear dimen-sional reduction methods using this approach include kernelprinciple component analysis [6] kernel Fisher discriminantanalysis [8] and kernel independent correlation analysis [9]This idea was applied to SIR in [10 11] resulting in the kernelsliced inverse regression (KSIR) method which allows for theestimation of nonlinear edr directions

There are numeric algorithmic and conceptual subtletiesto a direct application of this kernel idea to SIR although itlooks quite natural at first glance In KSIR the 119901-dimensionaldata are projected into a Hilbert space H through a featuremap 120601 119883 rarr H and the nonlinear features are supposed tobe recovered by the eigenfunctions of the following operator

119879 = [cov(120601(119883))]minus1 cov [E (120601 (119883) | 119884)] (3)

However this operator 119879 is actually not well defined in gen-eral especially when H is infinite dimensional and thecovariance operator cov(120601(119883)) is not invertible In additionthe key utility of representation theorems in kernel meth-ods is that optimization in a possibly infinite dimensionalHilbert space H reduces to solving a finite dimensionaloptimization problem In the KSIR algorithm developed in[10 11] the representer theorem has been implicitly used andseems to work well in empirical studies However it is nottheoretically justifiable since 119879 is estimated empirically viaobserved data Moreover the computation of eigenfunctionsof an empirical estimate of 119879 from observations is oftenill-conditioned and results in computational instability In[11] a low rank approximation of the kernel matrix is usedto overcome this instability and to reduce computationalcomplexity In this paper our aim is to clarify the theoreticalsubtleties in KSIR and tomotivate two types of regularizationschemes to overcome the computational difficulties arising inKSIR The consistency is proven and practical advantages ofregularization are demonstrated via empirical experiments

2 Mercer Kernels andNonlinear edr Directions

The extension of SIR to use kernels is based on properties ofreproducing kernel Hilbert spaces (RKHSs) and in particularMercer kernels [12]

Given predictor variables 119883 isin X sube R119901 a Mercer kernelis a continuous positive and semidefinite function 119896(sdot sdot) X timesX rarr R with the following spectral decomposition

119896 (119909 119911) = sum

119895

120582119895120601119895 (119909) 120601119895 (119911) (4)

where 120601119895 are the eigenfunctions and 120582

119895 are the corres-

ponding nonnegative nonincreasing eigenvalues An impor-tant property of Mercer kernels is that each kernel 119896 uniquelycorresponds to an RKHS as follows

H =

119891 | 119891 (119909) = sum

119895isinΛ

119886119895120601119895 (119909) with sum

119895isinΛ

1198862

119895

120582119895

lt infin

(5)

where the cardinality of Λ = 119895 120582119895gt 0 is the dimension of

the RKHS which may be infinite [12 13]Given a Mercer kernel there exists a unique map or

embedding 120601 from X to a Hilbert space defined by theeigenvalues and eigenfunctions of the kernel The map takesthe following form

120601 (119909) = (radic12058211206011 (119909) radic12058221206012 (119909) radic120582|Λ|120601|Λ| (119909)) (6)

The Hilbert space induced by this map with the standardinner product 119896(119909 119911) = ⟨120601(119909) 120601(119911)⟩ is isomorphic to theRKHS (5) and we will denote both Hilbert spaces asH [13]In the case where 119896 is infinite dimensional 120601 X rarr ℓ

2

The random variable 119883 isin X induces a random element120601(119883) in the RKHSThroughout this paper we will useHilbertspace valued random variables so we now recall some basicfacts Let119885 be a randomelement inHwithE119885 lt infin where sdot denotes the norm in H induced by its inner product⟨sdot sdot⟩ The expectation E(119885) is defined to be an element inHsatisfying ⟨119886E(119885)⟩ = E⟨119886 119885⟩ for all 119886 isin H If E1198852 le infinthen the covariance operator of 119885 is defined as E[(119885 minus E119885) otimes(119885 minus E119885)] where

(119886 otimes 119887) 119891 = ⟨119887 119891⟩ 119886 for any 119891 isinH (7)

Let P denote the measure for random variable 119883Throughout we assume the following conditions

Assumption 1

(1) For all 119909 isin X 119896(119909 sdot) isP-measurable(2) There exists119872 gt 0 such that 119909 isin X 119896(119883119883) le 119872

(as) with respect toP

Under Assumption 1 the random element 120601(119883) has awell-defined mean and a covariance operator because120601(119909)

2= 119896(119909 119909) is bounded (as)Without loss of generality

we assume E120601(119883) = 0 where 0 is the zero element in HThe boundedness also implies that the covariance operatorΣ = E[120601(119883)otimes120601(119883)] is compact and has the following spectraldecomposition

Σ =

infin

sum

119894=1

119908119894119890119894otimes 119890119894 (8)

where 119908119894and 119890119894isinH are the eigenvalues and eigenfunctions

respectivelyWe assume the following model for the relationship bet-

ween 119884 and119883

119884 = 119865 (⟨1205731 120601 (119883)⟩ ⟨120573119889 120601 (119883)⟩ 120576) (9)

with 120573119895isin H and the distribution of 120576 is independent of 119883

This model implies that the response variable 119884 depends on119883 only through a 119889-dimensional summary statistic

119878 (119883) = (⟨1205731 120601 (119883)⟩ ⟨120573119889 120601 (119883)⟩) (10)

Although 119878(119883) is a linear summary statistic inH it extractsnonlinear features in the space of the original predictor

Abstract and Applied Analysis 3

variables119883We call 120573119895119889

119895=1the nonlinear edr directions and

S = span(1205731 120573

119889) the nonlinear edr spaceThe following

proposition [10] extends the theoretical foundation of SIR tothis nonlinear setting

Proposition 2 Assume the following linear design conditionfor H that for any 119891 isin H there exists a vector 119887 isin R119889 suchthat

E [⟨119891 120601 (119883)⟩ | 119878 (119883)] = 119887119879119878 (119883)

with 119878 (119883) = (⟨1205731 120601 (119883)⟩ ⟨120573119889 120601 (119883)⟩)119879

(11)

Then for the model specified in (9) the inverse regression curveE[120601(119883) | 119884] is contained in the span of (Σ120573

1 Σ120573

119889) where

Σ is the covariance operator of 120601(119883)

Proposition 2 is a straightforward extension of the multi-variate case in [1] to a Hilbert space or a direct application ofthe functional SIR setting in [14] Although the linear designcondition (11) may be difficult to check in practice it has beenshown that such a condition usually holds approximately in ahigh-dimensional space [15] This conforms to the argumentin [11] that the linearity in a reproducing kernel Hilbertspace is less strict than the linearity in the Euclidean spaceMoreover it is pointed out in [1 10] that even if the lineardesign condition is violated the bias of SIR variate is usuallynot large

An immediate consequence of Proposition 2 is thatnonlinear edr directions are the eigenvectors correspondingto the largest eigenvalues of the following generalized eigen-decomposition problem

Γ120573 = 120582Σ120573

where Σ = cov [120601 (119883)] Γ = cov [E (120601 (119883) | 119884)] (12)

or equivalently from an eigenanalysis of the operator 119879 =

Σminus1Γ In the infinite dimensional case a technical difficulty

arises since the operator

Σminus1=

infin

sum

119894=1

119908minus1

119894119890119894otimes 119890119894 (13)

is not defined on the entire Hilbert space H So for theoperator 119879 to be well defined we need to show that therange of Γ is indeed in the range of Σ A similar issue alsoarose in the analysis of dimension reduction and canonicalanalysis for functional data [16 17] In these analyses extraconditions are needed for operators like 119879 to be well definedIn KSIR this issue is resolved automatically by the lineardesign condition and extra conditions are not required asstated by the following Theorem see Appendix A for theproof

Theorem 3 Under Assumption 1 and the linear design condi-tion (11) the following hold

(i) the operator Γ is of finite rank 119889Γle 119889 Consequently

it is compact and has the following spectral decomposi-tion

Γ =

119889Γ

sum

119894=1

120591119894119906119894otimes 119906119894 (14)

where 120591119894and 119906

119894are the eigenvalues and eigenvectors

respectively Moreover 119906119894isin range(Σ) for all 119894 = 1

119889Γ

(ii) the eigendecomposition problem (12) is equivalent tothe eigenanalysis of the operator 119879 which takes thefollowing form

119879 =

119889Γ

sum

119894=1

120591119894119906119894otimes Σminus1(119906119894) (15)

3 Regularized Kernel SlicedInverse Regression

Thediscussion in Section 2 implies that nonlinear edr direc-tions can be retrieved by applying the original SIR algorithmin the feature space induced by the Mercer kernel There aresome computational challenges to this idea such as estimatingan infinite dimensional covariance operator and the fact thatthe feature map is often difficult or impossible to compute formany kernels We address these issues by working with innerproducts of the feature map and adding a regularization termto kernel SIR

31 Estimating the Nonlinear edr Directions Given 119899 obser-vations (119909

1 1199101) (119909

119899 119910119899) our objective is to obtain

an estimate of the edr directions (1205731 120573

119889) We first

formulate a procedure almost identical to the standard SIRprocedure except that it operates in the feature spaceH Thishighlights the immediate relation between the SIR and KSIRprocedures

(1) Without loss of generality we assume that themappedpredictor variables aremean zero that issum119899

119894=1120601(119909119894) =

0 for otherwise we can subtract 120601 = (1119899)sum119899119894=1120601(119909119894)

from 120601(119909119894) The sample covariance is estimated by

Σ =1

119899

119899

sum

119894=1

120601 (119909119894) otimes 120601 (119909

119894) (16)

(2) Bin the119884 variables into119867 slices1198661 119866

119867and com-

pute mean vectors of the corresponding mapped pre-dictor variables for each slice

120595ℎ=1

119899ℎ

sum

119894isin119866ℎ

120601 (119909119894) ℎ = 1 119867 (17)

Compute the sample between-group covariance mat-rix

Γ =

119867

sum

ℎ=1

119899ℎ

119899120595ℎotimes 120595ℎ (18)

4 Abstract and Applied Analysis

(3) Estimate the SIR directions 120573119895by solving the general-

ized eigendecomposition problem

Γ120573 = 120582Σ120573 (19)

This procedure is computationally impossible if theRKHS is infinite dimensional or the feature map cannot becomputed (which is the usual case) However the modelgiven in (9) requires not the edr directions but the projec-tion onto these directions that is the 119889 summary statistics

V1= ⟨1205731 120601 (119909)⟩ V119889 = ⟨120573119889 120601 (119909)⟩ (20)

which we call the KSIR variates Like other kernel algorithmsthe kernel trick enables KSIR variates to be efficiently com-puted from only the kernel 119896 not the map 120601

The key quantity in this alternative formulation is thecentred Grammatrix119870 defined by the kernel function 119896(sdot sdot)where

119870119894119895= ⟨120601 (119909

119894) 120601 (119909

119895)⟩ = 119896 (119909

119894 119909119895) (21)

Note that the rank of119870 is less than 119899 so119870 is always singularGiven the centered Gram matrix 119870 the following gener-

alized eigendecomposition problem can be used to computethe KSIR variates

119870119869119870119888 = 1205821198702119888 (22)

where 119888 denotes the 119899-dimensional generalized eigenvectorand 119869 denotes an 119899 times 119899 matrix with 119869

119894119895= 1119899

119898if 119894 119895

are in the 119898th group consisting of 119899119898

observations andzero otherwise The following proposition establishes theequivalence between two eigendecomposition problems (22)and (19) in the recovery of KSIR variates (V

1 V

119889)

Proposition 4 Given the observations (1199091 1199101) (119909

119899 119910119899)

let (1205731 120573

119889) and (119888

1 119888

119889) denote the generalized eigenvec-

tors of (22) and (19) respectively Then for any 119909 isin X and119895 = 1 119889 the following holds

V119895= ⟨120573119895 120601 (119909)⟩ = 119888

119879

119895119870119909

119870119909= (119896 (119909 119909

1) 119896 (119909 119909

119899))119879

(23)

provided that Σ is invertible When Σ is not invertible theconclusion holds modulo the null space of Σ

This result was proven in [10] in which the algorithmwasfurther reduced to solving

119869119870119888 = 120582119870119888 (24)

by canceling 119870 from both sides of (22)

Remark 5 It is important to remark that when Σ is not invert-ible Proposition 4 states that the equivalence between (19)and (22) holds modulo the null space of Σ that is a requi-rement of the representer theorem that is 120573

119895are linear com-

binations of 120601(119909119894) Without this mandated condition we will

see that each eigenvector of (22) produces an eigenvector of(19) while the eigenvector of (19) is not necessarily recoveredby (22)

Remark 6 It is necessary to clarify the difference between (12)and (19) In (12) it is natural to assume that 120573 is orthogonalto the null space of Σ To see this let 120573 = 1205730 + 120573⋆ with 1205730belonging to the null space and 120573⋆ orthogonal to the nullspace Then

E [⟨120573 120601 (119909)⟩2] = ⟨120573 Σ120573⟩ = ⟨120573

⋆ Σ120573⋆⟩ = E [⟨120573

⋆ 120601 (119909)⟩

2]

(25)

that is ⟨120573 120601(119909)⟩ = ⟨120573⋆ 120601(119909)⟩ (as) It means that 1205730does

not contribute to the KSIR variates and thus could be setas 0 However in (19) if 120573 is an eigenvector and 120573⋆ is itsorthogonal component relative to the null space of Σ theidentity ⟨120573 120601(119909)⟩ = ⟨120573⋆ 120601(119909)⟩ is in general not true for anew point 119909 which is different from the observations Thusfrom a theoretical perspective it is not as natural to assumethe representer theorem although it works well in practiceIn this sense the KSIR algorithm based on (22) or (24) doesnot have a thorough mathematical foundation

32 Regularization and Stability Except for the theoreticalsubtleties in applications with relatively small samples theeigendecomposition in (22) is often ill-conditioned resultingin overfitting as well as numerically unstable estimates ofthe edr space This can be addressed by either thresholdingeigenvalues of the estimated covariancematrix Σ or by addinga regularization term to (22) or (24)

We motivate two types of regularization schemes Thefirst one is the traditional ridge regularization It is used inboth linear SIR and functional SIR [18ndash20] which solves theeigendecomposition problem

Γ120573 = 120582 (Σ + 119904119868) 120573 (26)

Here and in the sequel 119868 denotes the identity operator and 119904is a tuning parameter Assuming the representer theorem itskernel form is given as

119869119870119888 = 120582 (119870 + 119899119904119868) 119888 (27)

Another type of regularization is to regularize (22) dire-ctly

119870119869119870119888 = 120582 (1198702+ 1198992119904119868) 119888 (28)

The following proposition whose proof is in Appendix Bstates that solving the generalized eigendecomposition prob-lem (28) is equivalent to finding the eigenvectors of

(Σ2+ 119904119868)minus1

ΣΓ (29)

Proposition 7 Let 119888119895be the eigenvectors of (28) and let 120573

119895

be the eigenvectors of (29) Then the following holds for theregularized KSIR variates

V119895= 119888119879

119895[119896 (119909 119909

1) 119896 (119909 119909

119899)] = ⟨120573

119895 120601 (119909)⟩ (30)

Abstract and Applied Analysis 5

This algorithm is termed as the Tikhonov regularizationFor linear SIR it is shown in [21] that the Tikhonov regular-ization is more efficient than the ridge regularization

Except for the computational stability regularization alsomakes the matrix forms of KSIR (27) and (28) interpretableby justifiable representer theorem

Proposition 8 For both ridge and the Tikhonov regulariza-tion scheme of KSIR the eigenfunctions 120573

119895are linear combina-

tions of 120601(119909119894) 119894 = 1 119899

The conclusion follows from the observation that 120573119895=

(1120582119895)(Γ120573119895minus Σ120573119895) for the ridge regularization and 120573

119895=

(1120582119895)(ΣΓ120573

119895minus Σ2120573119895) for the Tikhonov regularization

To close we remark that KSIR is computationally advan-tageous even for the case of linear models when 119901 ≫ 119899 dueto the fact that the eigendecomposition problem is for 119899 times 119899matrices rather than the 119901 times 119901 matrices in the standard SIRformulation

33 Consistency of Regularized KSIR In this subsection weprove the asymptotic consistency of the edr directionsestimated by regularized KSIR and provide conditions underwhich the rate of convergence is 119874

119901(119899minus14) An important

observation from the proof is that the rate of convergence ofthe edr directions depends on the contribution of the smallprincipal components The rate can be arbitrarily slow if theedr space depends heavily on eigenvectors corresponding tosmall eigenvalues of the covariance operator

Note that various consistency results are available forlinear SIR [22ndash24] These results hold only for the finitedimensional setting and cannot be adapted to KSIR wherethe RKHS is often infinite dimensional Consistency of func-tional SIR has also been studied before In [14] a thresholdingmethod is considered which selects a finite number ofeigenvectors and uses results fromfinite rank operatorsTheirproof of consistency requires stronger and more complicatedconditions than oursThe consistency for functional SIR withridge regularization is proven in [19] but it is of aweaker formthan our result We remark that the consistency results forfunctional SIR can be improved using a similar argument inthis paper

In the following we state the consistency results for theTikhonov regularization A similar result can be proved forthe ridge regularization while the details are omitted

Theorem 9 Assume E119896(119883119883)2 lt infin lim119899rarrinfin

119904(119899) = 0 andlim119899rarrinfin

119904radic119899 = infin then

10038161003816100381610038161003816⟨120573119895 120601 (sdot)⟩ minus ⟨120573119895 120601 (sdot)⟩

10038161003816100381610038161003816= 119900119901 (1) 119895 = 1 119889

Γ (31)

where 119889Γis the rank of Γ ⟨120573

119895 120601(sdot)⟩ is the projection onto the

119895th edr and ⟨120573119895 120601(sdot)⟩ is the projection onto the 119895th edr as

estimated by regularized KSIRIf the edr directions 120573

119895119889Γ

119895=1depend only on a finite

number of eigenvectors of the covariance operator Σ the rateof convergence is 119874(119899minus14)

This theorem is a direct corollary of the following theoremwhich is proven in Appendix C

Theorem 10 Define the projection operator and its comple-ment for each119873 ge 1

Π119873=

119873

sum

119894=1

119890119894otimes 119890119894 Π

perp

119873= 119868 minus Π

119873=

infin

sum

119894=119873+1

119890119894otimes 119890119894 (32)

where 119890119894infin

119894=1are the eigenvectors of the covariance operator Σ

as defined in (8) with the corresponding eigenvalues denotedby 119908119894

Assume E119896(119883119883)2 lt infin For each 119873 ge 1 the followingholds10038171003817100381710038171003817(Σ2+ 119904119868)minus1ΣΓ minus 119879

10038171003817100381710038171003817119867119878

= 119874119901(1

119904radic119899) +

119889Γ

sum

119895=1

(119904

1199082119873

10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp

119873(119895)10038171003817100381710038171003817)

(33)

where 119895= Σminus1119906119895and 119906

119895119889Γ

119894=1are the eigenvectors of Γ as

defined in (14) and sdot 119867119878

denotes the Hilbert-Schmidt normof a linear operator

If 119904 = 119904(119899) satisfy 119904 rarr 0 and 119904radic119899 rarr infin as 119899 rarr infin then10038171003817100381710038171003817(Σ2+ 119904119868)minus1ΣΓ minus 119879

10038171003817100381710038171003817119867119878= 119900119901 (1) (34)

4 Application to Simulated and Real Data

In this section we compare regularized kernel sliced inverseregression (RKSIR) with several other SIR-related dimensionreductionmethodsThe comparisons are used to address twoquestions (1) does regularization improve the performanceof kernel sliced inverse regression and (2)does the nonlinear-ity of kernel sliced inverse regression improve the predictionaccuracy

We would like to remark that the assessment of nonlineardimension reduction methods could be more difficult thanthat of linear ones When the feature mapping 120601 for an RKHSis not available we do not know the true edr directions orsubspace So in that case we will use the prediction accuracyto evaluate the goodness of RKSIR

41 Importance of Nonlinearity and Regularization Our firstexample illustrates that both the nonlinearity and regulariza-tion of RKSIR can significantly improve prediction accuracy

The regression model has ten predictor variables 119883 =

(1198831 119883

10) with each one following a normal distribution

119883119894sim 119873(0 1) A univariate response is given as

119884 = (sin (1198831) + sin (119883

2)) (1 + sin (119883

3)) + 120576 (35)

with noise 120576 sim 119873(0 012) We compare the effectiveness

of the linear dimension reduction methods SIR SAVE andPHD with RKSIR by examining the predictive accuracy of anonlinear kernel regression model on the reduced space Wegenerate 100 training samples and apply the abovemethods to

6 Abstract and Applied Analysis

minus6 minus5 minus4 minus3 minus2 minus1 0 1 202

03

04

05

06

07

08

09

1M

ean

squa

re er

ror

d = 1

d = 2

log10

(s)

(a)

1 2 3 4 5 6 7 8 9 100

02

04

06

08

1

12

14

Number of edr directions

Mea

n sq

uare

erro

r

RKSIRSIR

SAVEPHD

(b)

minus2 minus15 minus1 minus05 0 05 1 15 2minus2

minus15

minus1

minus05

0

05

1

15

2

KSIR variates for the first nonlinear edr direction

sin(X

1)+

sin(X

2)

(c)

2

18

16

14

12

1

08

06

04

02

0minus15 minus1 minus05 0 05 1 15

KSIR variates for the second nonlinear edr direction

1+

sin(X

3)

(d)

Figure 1 Dimension reduction for model (35) (a) mean square error for RKSIR with different regularization parameters (b) mean squareerror for various dimension reduction methods The blue line represents the mean square error without using dimension reduction (c)-(d)RKSIR variates versus nonlinear factors

compute the edr directions In RKSIR we used an additiveGaussian kernel as follows

119896 (119909 119911) =

119889

sum

119895=1

exp(minus(119909119895minus 119911119895)2

21205902) (36)

where 120590 = 2 After projecting the training samples on theestimated edr directions we train a Gaussian kernel regres-sion model based on the new variatesThen the mean squareerror is computed on 1000 independent test samples Thisexperiment is repeated 100 times with all parameters set bycross-validation The results are summarized in Figure 1

Figure 1(a) displays the accuracy for RKSIR as a functionof the regularization parameter illustrating the importance of

selecting regularization parameters KSIRwithout regulariza-tion performs much worse than the RKSIR Figure 1(b) dis-plays the prediction accuracy of various dimension reductionmethods

RKSIR outperforms all the linear dimension reductionmethods which illustrates the power of nonlinearity intro-duced in RKSIR It also suggests that there are essentially twononlinear edr directions This observation seems to agreewith the model in (35) Indeed Figures 1(c) and 1(d) showthat the first two edr directions from RKSIR estimate thetwo nonlinear factors well

42 Effect of Regularization This example illustrates theeffect of regularization on the performance of KSIR as afunction of the anisotropy of the predictors

Abstract and Applied Analysis 7

The regression model has ten predictor variables 119883 =

(1198831 119883

10) and a univariate response specified by

119884 = 1198831+ 1198832

2+ 120576 120576 sim 119873 (0 01

2) (37)

where 119883 sim 119873(0 Σ119883) and Σ

119883= 119876Δ119876 with 119876 a randomly

chosen orthogonal matrix and Δ = diag(1120579 2120579 10120579) Wewill see that increasing the parameter 120579 isin [0infin) increasesthe anisotropy of the data that increases the difficulty ofidentifying the correct edr directions

For thismodel it is known that SIRwillmiss the directionalong the second variable119883

2 So we focus on the comparison

of KSIR and RKSIR in this exampleIf we use a second-order polynomial kernel 119896(119909 119911) =

(1 + 119909119879119911)2 that corresponds to the feature space

Φ (119883) = 1119883119894 (119883119894119883119895)119894le119895 119894 119895 = 1 10 (38)

then 1198831+ 1198832

2can be captured in one edr direction Ideally

the first KSIR variate V = ⟨1205731 120601(119883)⟩ should be equivalent to

1198831+ 1198832

2modulo shift and scale

V minus EV prop 1198831+ 1198832

2minus E (119883

1+ 1198832

2) (39)

So for this example given estimates of KSIR variates at the 119899data points V

119894119899

119894=1= ⟨1205731 120601(119909119894)⟩119899

119894=1 the error of the first edr

direction can bemeasured by the least squares fitting of Vwithrespect to (119883

1+ 1198832

2)

error = min119886119887isinR

1

119899

119899

sum

119894=1

(V119894minus (119886 (119909

1198941+ 1199092

1198942) + 119887))

2

(40)

We drew 200 observations from the model specifiedin (37) and then we applied the two dimension reductionmethods KSIR and RKSIR The mean and standard errorsof 100 repetitions of this procedure are reported in Figure 2The result shows that KSIR becomes more andmore unstableas 120579 increases and the regularization helps to reduce thisinstability

43 Importance of Nonlinearity and Regularization in RealData When SIR is applied to classification problems it isequivalent to a Fisher discriminant analysis For the case ofmulticlass classification it is natural to use SIR and considereach class as a slice Kernel forms of Fisher discriminantanalysis (KFDA) [8] have been used to construct nonlineardiscriminant surfaces and the regularization has improvedperformance of KFDA [25] In this example we show thatthis idea of adding a nonlinearity and a regularization termimproves predictive accuracy in a realmulticlass classificationdata set the classification of handwritten digits

The MNIST data set (Y LeCun httpyannlecuncomexdbmnist) contains 60 000 images of handwritten digits0 1 2 9 as training data and 10 000 images as test dataEach image consists of 119901 = 28 times 28 = 784 gray-scalepixel intensities It is commonly believed that there is clearnonlinear structure in this 784-dimensional space

minus05 0 05 1 15 2 250

100

200

300

400

500

600

700

120579 parameter

Erro

r rat

e

KSIRRKSIR

Figure 2 Error in edr as a function of 120579

We compared regularized SIR (RSIR) as in (26) KSIRand RKSIR on this data to examine the effect of regulariza-tion and nonlinearity Each draw of the training set consistedof 100 observations of each digit We then computed thetop 10 edr directions using these 1000 observations and10 slices one for each digit We projected the 10 000 testobservations onto the edr directions and used a k-nearestneighbor (kNN) classifier with 119896 = 5 to classify the test dataThe accuracy of the kNN classifier without dimensionreduction was used as a baseline For KSIR and RKSIR weused a Gaussian kernel with the bandwith parameter setas the median pairwise distance between observations Theregularization parameter was set by cross-validation

The mean and standard deviation of the classificationaccuracy over 100 iterations of this procedure are reportedin Table 1 The first interesting observation is that the lin-ear dimension reduction does not capture discriminativeinformation as the classification accuracywithout dimensionreduction is better Nonlinearity does increase classifica-tion accuracy and coupling regularization with nonlinearityincreases accuracymoreThis improvement is dramatic for 23 5 and 8

5 Discussion

The interest in manifold learning and nonlinear dimensionreduction in both statistics and machine learning has led toa variety of statistical models and algorithms However mostof these methods are developed in the unsupervised learningframework Therefore the estimated dimensions may notbe optimal for the regression models Our work incorpo-rates nonlinearity and regularization to inverse regressionapproaches and results in a robust response driven nonlineardimension reduction method

RKHS has also been introduced into supervised dimen-sion reduction in [26] where the conditional covariance

8 Abstract and Applied Analysis

Table 1 Mean and standard deviations for error rates in classification of digits

Digit RKSIR KSIR RSIR kNN0 00273 (00089) 00472 (00191) 00487 (00128) 00291 (00071)1 00150 (00049) 00177 (00051) 00292 (00113) 00052 (00012)2 01039 (00207) 01475 (00497) 01921 (00238) 02008 (00186)3 00845 (00208) 01279 (00494) 01723 (00283) 01092 (00130)4 00784 (00240) 01044 (00461) 01327 (00327) 01617 (00213)5 00877 (00209) 01327 (00540) 02146 (00294) 01419 (00193)6 00472 (00108) 00804 (00383) 00816 (00172) 00446 (00081)7 00887 (00169) 01119 (00357) 01354 (00172) 01140 (00125)8 00981 (00259) 01490 (00699) 01981 (00286) 01140 (00156)9 00774 (00251) 01095 (00398) 01533 (00212) 02006 (00153)Average 00708 (00105) 01016 (00190) 01358 (00093) 01177 (00039)

operators on such kernel spaces are used to characterize theconditional independence between linear projections and theresponse variable Therefore their method estimates linearedr subspaces while in [10 11] and in this paper RKHS isused to model the edr subspace which leads to nonlineardimension reduction

There are several open issues in regularized kernel SIRmethod such as the selection of kernels regularizationparameters and number of dimensions A direct assessmentof the nonlinear edr directions is expected to reduce thecomputational burden in procedures based on cross valida-tion While these are well established in linear dimensionreduction however little is known for nonlinear dimensionreduction We would like to leave them for future research

There are some interesting connections between KSIRand functional SIR which are developed by Ferre and hiscoauthors in a series of papers [14 17 19] In functional SIRthe observable data are functions and the goal is to find linearedr directions for functional data analysis In KSIR theobservable data are typically not functions but mapped into afunction space in order to characterize nonlinear structuresThis suggests that computations involved in functional SIRcan be simplified by a parametrization with respect to anRKHS or using a linear kernel in the parametrized functionspace On the other hand from a theoretical point of viewKSIR can be viewed as a special case of functional SIRalthough our current theoretical results on KSIR are differentfrom the ones for functional SIR

Appendices

A Proof of Theorem 3

Under the assumption of Proposition 2 for each 119884 = 119910

E [120601 (119883) | 119884 = 119910] isin span Σ120573119894 119894 = 1 119889 (A1)

Since Γ = cov[E(120601(119883) | 119884)] the rank of Γ (ie thedimension of the image of Γ) is less than 119889 which impliesthat Γ is compact With the fact that it is symmetric andsemipositive there exist 119889

Γpositive eigenvalues 120591

119894119889Γ

119894=1and

eigenvectors 119906119894119889Γ

119894=1 such that Γ = sum119889Γ

119894=1120591119894119906119894otimes 119906119894

Recall that for any 119891 isinH

Γ119891 = E ⟨E [120601 (119883) | 119884] 119891⟩E [120601 (119883) | 119884] (A2)

also belongs to

span Σ120573119894 119894 = 1 119889 sub range (Σ) (A3)

because of (A1) so

119906119894=1

120591119894

Γ119906119894isin range (Σ) (A4)

This proves (i)Since for each 119891 isin H Γ119891 isin range (Σminus1) the operator

119879 = Σminus1Γ is well defined over the whole space Moreover

119879119891 = Σminus1(

119889Γ

sum

119894=1

⟨119906119894 119891⟩ 119906119894)

=

119889Γ

sum

119894=1

⟨119906119894 119891⟩ Σ

minus1(119906119894) = (

119889Γ

sum

119894=1

Σminus1(119906119894) otimes 119906119894)119891

(A5)

This proves (ii)

B Proof of Proposition 7

We first prove the proposition for matrices to simplify thennotation we then extend the result to the operators where119889119870is infinite and a matrix form does not make senseLet Φ = [120601(119909

1) 120601(119909

119899)] It has the following SVD

decomposition

Φ = 119880119863119881119879

= [1199061sdot sdot sdot 119906119889119870] [

119863119889times119889

0119889times(119899minus119889)

0(119889119870minus119889)times119889

0(119889119870minus119889)times(119899minus119889)

][[

[

V1198791

V119879119899

]]

]

= 119880119863119881119879

(B1)

where 119880 = [1199061 119906

119889] 119881 = [V

1 V

119889] and 119863 = 119863

119889times119889is a

diagonal matrix of dimension 119889 le 119899

Abstract and Applied Analysis 9

We need to show the KSIR variates

V119895= 119888119879

119895[119896 (119909 119909

1) 119896 (119909 119909

119899)]

= 119888119879

119895Φ119879Φ (119909) = ⟨Φ119888119895 120601 (119909)⟩ = ⟨120573119895 120601 (119909)⟩

(B2)

It suffices to prove that if (120582 119888) is a solution to (28) then (120582 120573)is also a pair of eigenvalue and eigenvector of (Σ2 + 120574119868)minus1ΣΓand vice versa where 119888 and 120573 are related by

120573 = Φ119888 119888 = 119881119863119880119879

120573 (B3)

Noting that facts Σ = (1119899)ΦΦ119879 Γ = (1119899)Φ119869Φ119879 and 119870 =Φ119879Φ = 119881119863

2

119881119879 the argument may be made as follows

ΣΓ120573 = 120582 (Σ2+ 120574119868) 120573

lArrrArr ΦΦ119879Φ119869Φ119879Φ119888 = 120582 (ΦΦ

119879ΦΦ119879Φ119888 + 119899

2120574Φ119888)

lArrrArr Φ119870119869119870119888 = 120582Φ (1198702+ 1198992120574) 119888

(lArrrArr 119881119881119879

119870119869119870119888 = 120582119881119881119879

(1198702+ 1198992120574119868) 119888)

lArrrArr 119870119869119870119888 = 120582 (1198702+ 1198992120574119868) 119888

(B4)

Note that the implication in the third step is necessary only intherArr direction which is obtained by multiplying both sides119881119863minus1

119880119879 and using the facts119880119879119880 = 119868

119889 For the last step since

119881119879

119881 = 119868119889 we use the following facts

119881119881119879

119870 = 119881119881119879

1198811198632

119881119879

= 1198811198632

119881119879

= 119870

119881119881119879

119888 = 119881119881119879

119881119863minus1

119880119879

120573 = 119881119863minus1

119880119879

120573 = 119888

(B5)

In order for this result to hold rigorously when the RKHSis infinite dimensional we need to formally define Φ Φ119879and the SVD of Φ when 119889

119870is infinite For the infinite

dimensional case Φ is an operator from R119899 to H119870defined

by ΦV = sum119899119894=1

V119894120601(119909119894) for V = (V

1 V

119899)119879isin R119899 and Φ119879

is its adjoint an operator from H119870to R119899 such that Φ119879119891 =

(⟨120601(1199091) 119891⟩119870 ⟨120601(119909

1) 119891⟩119870)119879 for 119891 isin H

119870 The notions 119880

and 119880119879 are similarly definedThe above formulation ofΦ andΦ119879 coincides the defini-

tion of Σ as a covariance operator Since the rank of Σ is lessthan 119899 it is compact and has the following representation

Σ =

119889119870

sum

119894=1

119894119906119894otimes 119906119894=

119889

sum

119894=1

120590119894119906119894otimes 119906119894 (B6)

where 119889 le 119899 is the rank and 1205901ge 1205902ge sdot sdot sdot ge 120590

119889gt 120590119889+1=

sdot sdot sdot = 0 This implies that each 120601(119909119894) lies in span(119906

1 119906

119889)

and hence we can write 120601(119909119894) = 119880120591

119894 where 119880 = (119906

1 119906

119889)

should be considered as an operator fromR119889 toH119870and 120591119894isin

R119889 Denote Υ = (1205911 120591

119899)119879isin R119899times119889 It is easy to check that

Υ119879Υ = diag(119899120590

1 119899120590

119889) Let119863

119889times119889= diag(radic1198991205901 radic119899120590119889)

and119881 = Υ119863minus1Then we obtain the SVD forΦ asΦ = 119880119863119881119879

which is well defined

C Proof of Consistency

C1 Preliminaries In order to prove Theorems 9 and 10we use the properties of the Hilbert-Schmidt operatorscovariance operators for the Hilbert space valued randomvariables and the perturbation theory for linear operators Inthis subsection we provide a brief introduction to them Fordetails see [27ndash29] and references therein

Given a separable Hilbert space H of dimension 119901Ha linear operator 119871 on H is said to belong to the Hilbert-Schmidt class if

1198712

HS =

119901H

sum

119894=1

100381710038171003817100381711987111989011989410038171003817100381710038172

Hlt infin (C1)

where 119890119894 is an orthonormal basisTheHilbert-Schmidt class

forms a new Hilbert space with norm sdot HSGiven a bounded operator 119878 onH the operators 119878119871 and

119871119878 both belong to theHilbert-Schmidt class and the followingholds

119878119871HS le 119878 119871HS 119871119878HS le 119871HS 119878 (C2)

where sdot denotes the default operator norm

1198712= sup119891isinH

100381710038171003817100381711987111989110038171003817100381710038172

100381710038171003817100381711989110038171003817100381710038172 (C3)

Let 119885 be a random vector taking values in H satisfyingE1198852lt infin The covariance operator

Σ = E [(119885 minus E119885) otimes (119885 minus E119885)] (C4)

is self-adjoint positive compact and belongs to the Hilbert-Schmidt class

A well-known result from perturbation theory for linearoperators states that if a set of linear operators 119879

119899converges

to 119879 in the Hilbert-Schmidt norm and the eigenvalues of 119879are nondegenerate then the eigenvalues and eigenvectors of119879119899converge to those of 119879 with same rate or convergence as

the convergence of the operators

C2 Proof of Theorem 10 We will use the following resultfrom [14]

10038171003817100381710038171003817Σ minus Σ

10038171003817100381710038171003817HS = 119874119901 (1

radic119899)

10038171003817100381710038171003817Γ minus Γ

10038171003817100381710038171003817HS = 119874119901 (1

radic119899)

(C5)

To simplify the notion we denote 119904= (Σ2+ 119904119868)minus1

ΣΓ Alsodefine

1198791= (Σ2+ 119904119868)minus1

ΣΓ 1198792= (Σ2+ 119904119868)minus1

ΣΓ (C6)

Then10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS

le10038171003817100381710038171003817119904minus 1198791

10038171003817100381710038171003817HS +10038171003817100381710038171198791 minus 1198792

1003817100381710038171003817HS +10038171003817100381710038171198792 minus 119879

1003817100381710038171003817HS

(C7)

10 Abstract and Applied Analysis

For the first term observe that10038171003817100381710038171003817119904minus 1198791

10038171003817100381710038171003817HSle100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817

10038171003817100381710038171003817ΣΓ minus ΣΓ

10038171003817100381710038171003817HS= 119874119901(1

119904radic119899)

(C8)For the second term note that

1198791=

119889Γ

sum

119895=1

120591119895((Σ2+ 119904119868)minus1

Σ119906119895) otimes 119906119895

1198792=

119889Γ

sum

119895=1

120591119895((Σ2+ 119904119868)minus1

119906119895) otimes 119906119895

(C9)

Therefore

10038171003817100381710038171198791 minus 11987921003817100381710038171003817HS =

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1

minus (Σ2+ 119904119868)minus1

) Σ119906119895

100381710038171003817100381710038171003817

(C10)Since 119906

119895isin range(Σ) there exists

119895such that 119906

119895= Σ119895 Then

((Σ2+ 119904119868)minus1

minus (Σ2+ 119904119868)minus1

) Σ119906119895

= (Σ2+ 119904119868)minus1

(Σ2minus Σ2) (Σ2+ 119904119868)minus1

Σ2119895

(C11)

which implies

10038171003817100381710038171198791 minus 11987921003817100381710038171003817 le

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817

10038171003817100381710038171003817Σ2minus Σ210038171003817100381710038171003817HS

times100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ2100381710038171003817100381710038171003817

10038171003817100381710038171003817119895

10038171003817100381710038171003817

= 119874119901(1

119904radic119899)

(C12)

For the third term the following holds

10038171003817100381710038171198792 minus 11987910038171003817100381710038172

HS =

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1

Σ minus Σminus1) 119906119895

100381710038171003817100381710038171003817

2

(C13)

and for each 119895 = 1 119889Γ

100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ119906119895minus Σminus1119906119895

100381710038171003817100381710038171003817

le100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ2119895minus 119895

100381710038171003817100381710038171003817

=

1003817100381710038171003817100381710038171003817100381710038171003817

infin

sum

119894=1

(1199082

119895

119904 + 1199082119895

minus 1)⟨119895 119890119894⟩ 119890119894

1003817100381710038171003817100381710038171003817100381710038171003817

= (

infin

sum

119894=1

1199042

(119904 + 1199082119894)2⟨119895 119890119894⟩2

)

12

le119904

119908119873

(

119873

sum

119894=1

⟨119895 119890119894⟩2

)

12

+ (

infin

sum

119894=119873+1

⟨119895 119890119894⟩2

)

12

=119904

1199082119873

10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp

119873(119895)10038171003817100381710038171003817

(C14)

Combining these terms results in (33)

Since Πperp119873(119895) rarr 0 as119873 rarr infin consequently we have

10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS= 119900119901 (1) (C15)

if 119904 rarr 0 and 119904radic119899 rarr infinIf all the edr directions 120573

119894depend only on a finite

number of eigenvectors of the covariance operator then thereexist some 119873 gt 1 such that Slowast = spanΣ119890

119894 119894 = 1 119873

This implies that

119895= Σminus1119906119895isin Σminus1(Slowast) sub span 119890

119894 119894 = 1 119873 (C16)

Therefore Πperp119873(119895) = 0 Let 119904 = 119874(119899minus14) the rate is119874(11989914)

Acknowledgments

The authors acknowledge the support of the National Sci-ence Foundation (DMS-0732276 and DMS-0732260) and theNational Institutes ofHealth (P50GM081883) Any opinionsfindings and conclusions or recommendations expressed inthis work are those of the authors and do not necessarilyreflect the views of the NSF or NIH

References

[1] K Li ldquoSliced inverse regression for dimension reduction (withdiscussion)rdquo Journal of the American Statistical Association vol86 no 414 pp 316ndash342 1991

[2] N Duan and K Li ldquoSlicing regression a link-free regressionmethodrdquoTheAnnals of Statistics vol 19 no 2 pp 505ndash530 1991

[3] R Cook and S Weisberg ldquoCommentrdquo Journal of the AmericanStatistical Association vol 86 no 414 pp 328ndash332 1991

[4] K C Li ldquoOn principal hessian directions for data visualizationand dimension reduction another application of steinrsquos lemmardquoJournal of the American Statistical Association vol 87 no 420pp 1025ndash1039 1992

[5] L Li R D Cook and C-L Tsai ldquoPartial inverse regressionrdquoBiometrika vol 94 no 3 pp 615ndash625 2007

[6] B Scholkopf A J Smola and K Muller ldquoKernel principalcomponent analysisrdquo in Proceedings of the Artificial NeuralNetworks (ICANN rsquo97) W Gerstner A Germond M Haslerand J-D Nicoud Eds vol 1327 of Lecture Notes in ComputerScience pp 583ndash588 Springer Lausanne Switzerland October1997

[7] B Scholkopf and A J Smola Learning with Kernels The MITPress Massachusetts Mass USA 2002

[8] G Baudat and F Anouar ldquoGeneralized discriminant analysisusing a kernel approachrdquo Neural Computation vol 12 no 10pp 2385ndash2404 2000

[9] F R Bach and M I Jordan ldquoKernel independent componentanalysisrdquo Journal of Machine Learning Research vol 3 no 1 pp1ndash48 2002

[10] H-M Wu ldquoKernel sliced inverse regression with applicationsto classificationrdquo Journal of Computational and Graphical Statis-tics vol 17 no 3 pp 590ndash610 2008

[11] Y-R Yeh S-Y Huang and Y-J Lee ldquoNonlinear dimensionreduction with kernel sliced inverse regressionrdquo IEEE Trans-actions on Knowledge and Data Engineering vol 21 no 11 pp1590ndash1603 2009

Abstract and Applied Analysis 11

[12] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society of London A vol 209 no 441ndash458 pp 415ndash446 1909

[13] H Konig Eigenvalue Distribution of Compact Operators vol16 of Operator Theory Advances and Applications CH BaselSwitzerland 1986

[14] L Ferre and A Yao ldquoFunctional sliced inverse regression anal-ysisrdquo Statistics vol 37 no 6 pp 475ndash488 2003

[15] P Hall and K-C Li ldquoOn almost linearity of low-dimensionalprojections from high-dimensional datardquo The Annals of Statis-tics vol 21 no 2 pp 867ndash889 1993

[16] G He H-G Muller and J-L Wang ldquoFunctional canonicalanalysis for square integrable stochastic processesrdquo Journal ofMultivariate Analysis vol 85 no 1 pp 54ndash77 2003

[17] L Ferre and A Yao ldquoSmoothed functional inverse regressionrdquoStatistica Sinica vol 15 no 3 pp 665ndash683 2005

[18] W Zhong P Zeng P Ma J S Liu and Y Zhu ldquoRSIR regular-ized sliced inverse regression for motif discoveryrdquo Bioinformat-ics vol 21 no 22 pp 4169ndash4175 2005

[19] L Ferre and N Villa ldquoMultilayer perceptron with functionalinputs an inverse regression approachrdquo Scandinavian Journalof Statistics vol 33 no 4 pp 807ndash823 2006

[20] L Li andX Yin ldquoSliced inverse regressionwith regularizationsrdquoBiometrics vol 64 no 1 pp 124ndash131 2008

[21] C Bernard-Michel L Gardes and S Girard ldquoGaussian regular-ized sliced inverse regressionrdquo Statistics and Computing vol 19no 1 pp 85ndash98 2009

[22] T Hsing and R J Carroll ldquoAn asymptotic theory for slicedinverse regressionrdquo The Annals of Statistics vol 20 no 2 pp1040ndash1061 1992

[23] J Saracco ldquoAn asymptotic theory for sliced inverse regressionrdquoCommunications in Statistics vol 26 no 9 pp 2141ndash2171 1997

[24] L X Zhu and K W Ng ldquoAsymptotics of sliced inverse regres-sionrdquo Statistica Sinica vol 5 no 2 pp 727ndash736 1995

[25] T Kurita and T Taguchi ldquoA kernel-based fisher discriminantanalysis for face detectionrdquo IEICE Transactions on Informationand Systems vol 88 no 3 pp 628ndash635 2005

[26] K Fukumizu F R Bach and M I Jordan ldquoKernel dimensionreduction in regressionrdquo The Annals of Statistics vol 37 no 4pp 1871ndash1905 2009

[27] T Kato Perturbation Theory for Linear Operators SpringerBerlin Germany 1966

[28] F Chatelin Spectral Approximation of Linear Operators Aca-demic Press 1983

[29] G Blanchard O Bousquet and L Zwald ldquoStatistical propertiesof kernel principal component analysisrdquoMachine Learning vol66 no 2-3 pp 259ndash294 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 2: Research Article Kernel Sliced Inverse Regression ...downloads.hindawi.com/journals/aaa/2013/540725.pdf · high-dimensional space [ ]. is conforms to the argument in [ ] that the

2 Abstract and Applied Analysis

space Nonlinear extensions of some classical linear dimen-sional reduction methods using this approach include kernelprinciple component analysis [6] kernel Fisher discriminantanalysis [8] and kernel independent correlation analysis [9]This idea was applied to SIR in [10 11] resulting in the kernelsliced inverse regression (KSIR) method which allows for theestimation of nonlinear edr directions

There are numeric algorithmic and conceptual subtletiesto a direct application of this kernel idea to SIR although itlooks quite natural at first glance In KSIR the 119901-dimensionaldata are projected into a Hilbert space H through a featuremap 120601 119883 rarr H and the nonlinear features are supposed tobe recovered by the eigenfunctions of the following operator

119879 = [cov(120601(119883))]minus1 cov [E (120601 (119883) | 119884)] (3)

However this operator 119879 is actually not well defined in gen-eral especially when H is infinite dimensional and thecovariance operator cov(120601(119883)) is not invertible In additionthe key utility of representation theorems in kernel meth-ods is that optimization in a possibly infinite dimensionalHilbert space H reduces to solving a finite dimensionaloptimization problem In the KSIR algorithm developed in[10 11] the representer theorem has been implicitly used andseems to work well in empirical studies However it is nottheoretically justifiable since 119879 is estimated empirically viaobserved data Moreover the computation of eigenfunctionsof an empirical estimate of 119879 from observations is oftenill-conditioned and results in computational instability In[11] a low rank approximation of the kernel matrix is usedto overcome this instability and to reduce computationalcomplexity In this paper our aim is to clarify the theoreticalsubtleties in KSIR and tomotivate two types of regularizationschemes to overcome the computational difficulties arising inKSIR The consistency is proven and practical advantages ofregularization are demonstrated via empirical experiments

2 Mercer Kernels andNonlinear edr Directions

The extension of SIR to use kernels is based on properties ofreproducing kernel Hilbert spaces (RKHSs) and in particularMercer kernels [12]

Given predictor variables 119883 isin X sube R119901 a Mercer kernelis a continuous positive and semidefinite function 119896(sdot sdot) X timesX rarr R with the following spectral decomposition

119896 (119909 119911) = sum

119895

120582119895120601119895 (119909) 120601119895 (119911) (4)

where 120601119895 are the eigenfunctions and 120582

119895 are the corres-

ponding nonnegative nonincreasing eigenvalues An impor-tant property of Mercer kernels is that each kernel 119896 uniquelycorresponds to an RKHS as follows

H =

119891 | 119891 (119909) = sum

119895isinΛ

119886119895120601119895 (119909) with sum

119895isinΛ

1198862

119895

120582119895

lt infin

(5)

where the cardinality of Λ = 119895 120582119895gt 0 is the dimension of

the RKHS which may be infinite [12 13]Given a Mercer kernel there exists a unique map or

embedding 120601 from X to a Hilbert space defined by theeigenvalues and eigenfunctions of the kernel The map takesthe following form

120601 (119909) = (radic12058211206011 (119909) radic12058221206012 (119909) radic120582|Λ|120601|Λ| (119909)) (6)

The Hilbert space induced by this map with the standardinner product 119896(119909 119911) = ⟨120601(119909) 120601(119911)⟩ is isomorphic to theRKHS (5) and we will denote both Hilbert spaces asH [13]In the case where 119896 is infinite dimensional 120601 X rarr ℓ

2

The random variable 119883 isin X induces a random element120601(119883) in the RKHSThroughout this paper we will useHilbertspace valued random variables so we now recall some basicfacts Let119885 be a randomelement inHwithE119885 lt infin where sdot denotes the norm in H induced by its inner product⟨sdot sdot⟩ The expectation E(119885) is defined to be an element inHsatisfying ⟨119886E(119885)⟩ = E⟨119886 119885⟩ for all 119886 isin H If E1198852 le infinthen the covariance operator of 119885 is defined as E[(119885 minus E119885) otimes(119885 minus E119885)] where

(119886 otimes 119887) 119891 = ⟨119887 119891⟩ 119886 for any 119891 isinH (7)

Let P denote the measure for random variable 119883Throughout we assume the following conditions

Assumption 1

(1) For all 119909 isin X 119896(119909 sdot) isP-measurable(2) There exists119872 gt 0 such that 119909 isin X 119896(119883119883) le 119872

(as) with respect toP

Under Assumption 1 the random element 120601(119883) has awell-defined mean and a covariance operator because120601(119909)

2= 119896(119909 119909) is bounded (as)Without loss of generality

we assume E120601(119883) = 0 where 0 is the zero element in HThe boundedness also implies that the covariance operatorΣ = E[120601(119883)otimes120601(119883)] is compact and has the following spectraldecomposition

Σ =

infin

sum

119894=1

119908119894119890119894otimes 119890119894 (8)

where 119908119894and 119890119894isinH are the eigenvalues and eigenfunctions

respectivelyWe assume the following model for the relationship bet-

ween 119884 and119883

119884 = 119865 (⟨1205731 120601 (119883)⟩ ⟨120573119889 120601 (119883)⟩ 120576) (9)

with 120573119895isin H and the distribution of 120576 is independent of 119883

This model implies that the response variable 119884 depends on119883 only through a 119889-dimensional summary statistic

119878 (119883) = (⟨1205731 120601 (119883)⟩ ⟨120573119889 120601 (119883)⟩) (10)

Although 119878(119883) is a linear summary statistic inH it extractsnonlinear features in the space of the original predictor

Abstract and Applied Analysis 3

variables119883We call 120573119895119889

119895=1the nonlinear edr directions and

S = span(1205731 120573

119889) the nonlinear edr spaceThe following

proposition [10] extends the theoretical foundation of SIR tothis nonlinear setting

Proposition 2 Assume the following linear design conditionfor H that for any 119891 isin H there exists a vector 119887 isin R119889 suchthat

E [⟨119891 120601 (119883)⟩ | 119878 (119883)] = 119887119879119878 (119883)

with 119878 (119883) = (⟨1205731 120601 (119883)⟩ ⟨120573119889 120601 (119883)⟩)119879

(11)

Then for the model specified in (9) the inverse regression curveE[120601(119883) | 119884] is contained in the span of (Σ120573

1 Σ120573

119889) where

Σ is the covariance operator of 120601(119883)

Proposition 2 is a straightforward extension of the multi-variate case in [1] to a Hilbert space or a direct application ofthe functional SIR setting in [14] Although the linear designcondition (11) may be difficult to check in practice it has beenshown that such a condition usually holds approximately in ahigh-dimensional space [15] This conforms to the argumentin [11] that the linearity in a reproducing kernel Hilbertspace is less strict than the linearity in the Euclidean spaceMoreover it is pointed out in [1 10] that even if the lineardesign condition is violated the bias of SIR variate is usuallynot large

An immediate consequence of Proposition 2 is thatnonlinear edr directions are the eigenvectors correspondingto the largest eigenvalues of the following generalized eigen-decomposition problem

Γ120573 = 120582Σ120573

where Σ = cov [120601 (119883)] Γ = cov [E (120601 (119883) | 119884)] (12)

or equivalently from an eigenanalysis of the operator 119879 =

Σminus1Γ In the infinite dimensional case a technical difficulty

arises since the operator

Σminus1=

infin

sum

119894=1

119908minus1

119894119890119894otimes 119890119894 (13)

is not defined on the entire Hilbert space H So for theoperator 119879 to be well defined we need to show that therange of Γ is indeed in the range of Σ A similar issue alsoarose in the analysis of dimension reduction and canonicalanalysis for functional data [16 17] In these analyses extraconditions are needed for operators like 119879 to be well definedIn KSIR this issue is resolved automatically by the lineardesign condition and extra conditions are not required asstated by the following Theorem see Appendix A for theproof

Theorem 3 Under Assumption 1 and the linear design condi-tion (11) the following hold

(i) the operator Γ is of finite rank 119889Γle 119889 Consequently

it is compact and has the following spectral decomposi-tion

Γ =

119889Γ

sum

119894=1

120591119894119906119894otimes 119906119894 (14)

where 120591119894and 119906

119894are the eigenvalues and eigenvectors

respectively Moreover 119906119894isin range(Σ) for all 119894 = 1

119889Γ

(ii) the eigendecomposition problem (12) is equivalent tothe eigenanalysis of the operator 119879 which takes thefollowing form

119879 =

119889Γ

sum

119894=1

120591119894119906119894otimes Σminus1(119906119894) (15)

3 Regularized Kernel SlicedInverse Regression

Thediscussion in Section 2 implies that nonlinear edr direc-tions can be retrieved by applying the original SIR algorithmin the feature space induced by the Mercer kernel There aresome computational challenges to this idea such as estimatingan infinite dimensional covariance operator and the fact thatthe feature map is often difficult or impossible to compute formany kernels We address these issues by working with innerproducts of the feature map and adding a regularization termto kernel SIR

31 Estimating the Nonlinear edr Directions Given 119899 obser-vations (119909

1 1199101) (119909

119899 119910119899) our objective is to obtain

an estimate of the edr directions (1205731 120573

119889) We first

formulate a procedure almost identical to the standard SIRprocedure except that it operates in the feature spaceH Thishighlights the immediate relation between the SIR and KSIRprocedures

(1) Without loss of generality we assume that themappedpredictor variables aremean zero that issum119899

119894=1120601(119909119894) =

0 for otherwise we can subtract 120601 = (1119899)sum119899119894=1120601(119909119894)

from 120601(119909119894) The sample covariance is estimated by

Σ =1

119899

119899

sum

119894=1

120601 (119909119894) otimes 120601 (119909

119894) (16)

(2) Bin the119884 variables into119867 slices1198661 119866

119867and com-

pute mean vectors of the corresponding mapped pre-dictor variables for each slice

120595ℎ=1

119899ℎ

sum

119894isin119866ℎ

120601 (119909119894) ℎ = 1 119867 (17)

Compute the sample between-group covariance mat-rix

Γ =

119867

sum

ℎ=1

119899ℎ

119899120595ℎotimes 120595ℎ (18)

4 Abstract and Applied Analysis

(3) Estimate the SIR directions 120573119895by solving the general-

ized eigendecomposition problem

Γ120573 = 120582Σ120573 (19)

This procedure is computationally impossible if theRKHS is infinite dimensional or the feature map cannot becomputed (which is the usual case) However the modelgiven in (9) requires not the edr directions but the projec-tion onto these directions that is the 119889 summary statistics

V1= ⟨1205731 120601 (119909)⟩ V119889 = ⟨120573119889 120601 (119909)⟩ (20)

which we call the KSIR variates Like other kernel algorithmsthe kernel trick enables KSIR variates to be efficiently com-puted from only the kernel 119896 not the map 120601

The key quantity in this alternative formulation is thecentred Grammatrix119870 defined by the kernel function 119896(sdot sdot)where

119870119894119895= ⟨120601 (119909

119894) 120601 (119909

119895)⟩ = 119896 (119909

119894 119909119895) (21)

Note that the rank of119870 is less than 119899 so119870 is always singularGiven the centered Gram matrix 119870 the following gener-

alized eigendecomposition problem can be used to computethe KSIR variates

119870119869119870119888 = 1205821198702119888 (22)

where 119888 denotes the 119899-dimensional generalized eigenvectorand 119869 denotes an 119899 times 119899 matrix with 119869

119894119895= 1119899

119898if 119894 119895

are in the 119898th group consisting of 119899119898

observations andzero otherwise The following proposition establishes theequivalence between two eigendecomposition problems (22)and (19) in the recovery of KSIR variates (V

1 V

119889)

Proposition 4 Given the observations (1199091 1199101) (119909

119899 119910119899)

let (1205731 120573

119889) and (119888

1 119888

119889) denote the generalized eigenvec-

tors of (22) and (19) respectively Then for any 119909 isin X and119895 = 1 119889 the following holds

V119895= ⟨120573119895 120601 (119909)⟩ = 119888

119879

119895119870119909

119870119909= (119896 (119909 119909

1) 119896 (119909 119909

119899))119879

(23)

provided that Σ is invertible When Σ is not invertible theconclusion holds modulo the null space of Σ

This result was proven in [10] in which the algorithmwasfurther reduced to solving

119869119870119888 = 120582119870119888 (24)

by canceling 119870 from both sides of (22)

Remark 5 It is important to remark that when Σ is not invert-ible Proposition 4 states that the equivalence between (19)and (22) holds modulo the null space of Σ that is a requi-rement of the representer theorem that is 120573

119895are linear com-

binations of 120601(119909119894) Without this mandated condition we will

see that each eigenvector of (22) produces an eigenvector of(19) while the eigenvector of (19) is not necessarily recoveredby (22)

Remark 6 It is necessary to clarify the difference between (12)and (19) In (12) it is natural to assume that 120573 is orthogonalto the null space of Σ To see this let 120573 = 1205730 + 120573⋆ with 1205730belonging to the null space and 120573⋆ orthogonal to the nullspace Then

E [⟨120573 120601 (119909)⟩2] = ⟨120573 Σ120573⟩ = ⟨120573

⋆ Σ120573⋆⟩ = E [⟨120573

⋆ 120601 (119909)⟩

2]

(25)

that is ⟨120573 120601(119909)⟩ = ⟨120573⋆ 120601(119909)⟩ (as) It means that 1205730does

not contribute to the KSIR variates and thus could be setas 0 However in (19) if 120573 is an eigenvector and 120573⋆ is itsorthogonal component relative to the null space of Σ theidentity ⟨120573 120601(119909)⟩ = ⟨120573⋆ 120601(119909)⟩ is in general not true for anew point 119909 which is different from the observations Thusfrom a theoretical perspective it is not as natural to assumethe representer theorem although it works well in practiceIn this sense the KSIR algorithm based on (22) or (24) doesnot have a thorough mathematical foundation

32 Regularization and Stability Except for the theoreticalsubtleties in applications with relatively small samples theeigendecomposition in (22) is often ill-conditioned resultingin overfitting as well as numerically unstable estimates ofthe edr space This can be addressed by either thresholdingeigenvalues of the estimated covariancematrix Σ or by addinga regularization term to (22) or (24)

We motivate two types of regularization schemes Thefirst one is the traditional ridge regularization It is used inboth linear SIR and functional SIR [18ndash20] which solves theeigendecomposition problem

Γ120573 = 120582 (Σ + 119904119868) 120573 (26)

Here and in the sequel 119868 denotes the identity operator and 119904is a tuning parameter Assuming the representer theorem itskernel form is given as

119869119870119888 = 120582 (119870 + 119899119904119868) 119888 (27)

Another type of regularization is to regularize (22) dire-ctly

119870119869119870119888 = 120582 (1198702+ 1198992119904119868) 119888 (28)

The following proposition whose proof is in Appendix Bstates that solving the generalized eigendecomposition prob-lem (28) is equivalent to finding the eigenvectors of

(Σ2+ 119904119868)minus1

ΣΓ (29)

Proposition 7 Let 119888119895be the eigenvectors of (28) and let 120573

119895

be the eigenvectors of (29) Then the following holds for theregularized KSIR variates

V119895= 119888119879

119895[119896 (119909 119909

1) 119896 (119909 119909

119899)] = ⟨120573

119895 120601 (119909)⟩ (30)

Abstract and Applied Analysis 5

This algorithm is termed as the Tikhonov regularizationFor linear SIR it is shown in [21] that the Tikhonov regular-ization is more efficient than the ridge regularization

Except for the computational stability regularization alsomakes the matrix forms of KSIR (27) and (28) interpretableby justifiable representer theorem

Proposition 8 For both ridge and the Tikhonov regulariza-tion scheme of KSIR the eigenfunctions 120573

119895are linear combina-

tions of 120601(119909119894) 119894 = 1 119899

The conclusion follows from the observation that 120573119895=

(1120582119895)(Γ120573119895minus Σ120573119895) for the ridge regularization and 120573

119895=

(1120582119895)(ΣΓ120573

119895minus Σ2120573119895) for the Tikhonov regularization

To close we remark that KSIR is computationally advan-tageous even for the case of linear models when 119901 ≫ 119899 dueto the fact that the eigendecomposition problem is for 119899 times 119899matrices rather than the 119901 times 119901 matrices in the standard SIRformulation

33 Consistency of Regularized KSIR In this subsection weprove the asymptotic consistency of the edr directionsestimated by regularized KSIR and provide conditions underwhich the rate of convergence is 119874

119901(119899minus14) An important

observation from the proof is that the rate of convergence ofthe edr directions depends on the contribution of the smallprincipal components The rate can be arbitrarily slow if theedr space depends heavily on eigenvectors corresponding tosmall eigenvalues of the covariance operator

Note that various consistency results are available forlinear SIR [22ndash24] These results hold only for the finitedimensional setting and cannot be adapted to KSIR wherethe RKHS is often infinite dimensional Consistency of func-tional SIR has also been studied before In [14] a thresholdingmethod is considered which selects a finite number ofeigenvectors and uses results fromfinite rank operatorsTheirproof of consistency requires stronger and more complicatedconditions than oursThe consistency for functional SIR withridge regularization is proven in [19] but it is of aweaker formthan our result We remark that the consistency results forfunctional SIR can be improved using a similar argument inthis paper

In the following we state the consistency results for theTikhonov regularization A similar result can be proved forthe ridge regularization while the details are omitted

Theorem 9 Assume E119896(119883119883)2 lt infin lim119899rarrinfin

119904(119899) = 0 andlim119899rarrinfin

119904radic119899 = infin then

10038161003816100381610038161003816⟨120573119895 120601 (sdot)⟩ minus ⟨120573119895 120601 (sdot)⟩

10038161003816100381610038161003816= 119900119901 (1) 119895 = 1 119889

Γ (31)

where 119889Γis the rank of Γ ⟨120573

119895 120601(sdot)⟩ is the projection onto the

119895th edr and ⟨120573119895 120601(sdot)⟩ is the projection onto the 119895th edr as

estimated by regularized KSIRIf the edr directions 120573

119895119889Γ

119895=1depend only on a finite

number of eigenvectors of the covariance operator Σ the rateof convergence is 119874(119899minus14)

This theorem is a direct corollary of the following theoremwhich is proven in Appendix C

Theorem 10 Define the projection operator and its comple-ment for each119873 ge 1

Π119873=

119873

sum

119894=1

119890119894otimes 119890119894 Π

perp

119873= 119868 minus Π

119873=

infin

sum

119894=119873+1

119890119894otimes 119890119894 (32)

where 119890119894infin

119894=1are the eigenvectors of the covariance operator Σ

as defined in (8) with the corresponding eigenvalues denotedby 119908119894

Assume E119896(119883119883)2 lt infin For each 119873 ge 1 the followingholds10038171003817100381710038171003817(Σ2+ 119904119868)minus1ΣΓ minus 119879

10038171003817100381710038171003817119867119878

= 119874119901(1

119904radic119899) +

119889Γ

sum

119895=1

(119904

1199082119873

10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp

119873(119895)10038171003817100381710038171003817)

(33)

where 119895= Σminus1119906119895and 119906

119895119889Γ

119894=1are the eigenvectors of Γ as

defined in (14) and sdot 119867119878

denotes the Hilbert-Schmidt normof a linear operator

If 119904 = 119904(119899) satisfy 119904 rarr 0 and 119904radic119899 rarr infin as 119899 rarr infin then10038171003817100381710038171003817(Σ2+ 119904119868)minus1ΣΓ minus 119879

10038171003817100381710038171003817119867119878= 119900119901 (1) (34)

4 Application to Simulated and Real Data

In this section we compare regularized kernel sliced inverseregression (RKSIR) with several other SIR-related dimensionreductionmethodsThe comparisons are used to address twoquestions (1) does regularization improve the performanceof kernel sliced inverse regression and (2)does the nonlinear-ity of kernel sliced inverse regression improve the predictionaccuracy

We would like to remark that the assessment of nonlineardimension reduction methods could be more difficult thanthat of linear ones When the feature mapping 120601 for an RKHSis not available we do not know the true edr directions orsubspace So in that case we will use the prediction accuracyto evaluate the goodness of RKSIR

41 Importance of Nonlinearity and Regularization Our firstexample illustrates that both the nonlinearity and regulariza-tion of RKSIR can significantly improve prediction accuracy

The regression model has ten predictor variables 119883 =

(1198831 119883

10) with each one following a normal distribution

119883119894sim 119873(0 1) A univariate response is given as

119884 = (sin (1198831) + sin (119883

2)) (1 + sin (119883

3)) + 120576 (35)

with noise 120576 sim 119873(0 012) We compare the effectiveness

of the linear dimension reduction methods SIR SAVE andPHD with RKSIR by examining the predictive accuracy of anonlinear kernel regression model on the reduced space Wegenerate 100 training samples and apply the abovemethods to

6 Abstract and Applied Analysis

minus6 minus5 minus4 minus3 minus2 minus1 0 1 202

03

04

05

06

07

08

09

1M

ean

squa

re er

ror

d = 1

d = 2

log10

(s)

(a)

1 2 3 4 5 6 7 8 9 100

02

04

06

08

1

12

14

Number of edr directions

Mea

n sq

uare

erro

r

RKSIRSIR

SAVEPHD

(b)

minus2 minus15 minus1 minus05 0 05 1 15 2minus2

minus15

minus1

minus05

0

05

1

15

2

KSIR variates for the first nonlinear edr direction

sin(X

1)+

sin(X

2)

(c)

2

18

16

14

12

1

08

06

04

02

0minus15 minus1 minus05 0 05 1 15

KSIR variates for the second nonlinear edr direction

1+

sin(X

3)

(d)

Figure 1 Dimension reduction for model (35) (a) mean square error for RKSIR with different regularization parameters (b) mean squareerror for various dimension reduction methods The blue line represents the mean square error without using dimension reduction (c)-(d)RKSIR variates versus nonlinear factors

compute the edr directions In RKSIR we used an additiveGaussian kernel as follows

119896 (119909 119911) =

119889

sum

119895=1

exp(minus(119909119895minus 119911119895)2

21205902) (36)

where 120590 = 2 After projecting the training samples on theestimated edr directions we train a Gaussian kernel regres-sion model based on the new variatesThen the mean squareerror is computed on 1000 independent test samples Thisexperiment is repeated 100 times with all parameters set bycross-validation The results are summarized in Figure 1

Figure 1(a) displays the accuracy for RKSIR as a functionof the regularization parameter illustrating the importance of

selecting regularization parameters KSIRwithout regulariza-tion performs much worse than the RKSIR Figure 1(b) dis-plays the prediction accuracy of various dimension reductionmethods

RKSIR outperforms all the linear dimension reductionmethods which illustrates the power of nonlinearity intro-duced in RKSIR It also suggests that there are essentially twononlinear edr directions This observation seems to agreewith the model in (35) Indeed Figures 1(c) and 1(d) showthat the first two edr directions from RKSIR estimate thetwo nonlinear factors well

42 Effect of Regularization This example illustrates theeffect of regularization on the performance of KSIR as afunction of the anisotropy of the predictors

Abstract and Applied Analysis 7

The regression model has ten predictor variables 119883 =

(1198831 119883

10) and a univariate response specified by

119884 = 1198831+ 1198832

2+ 120576 120576 sim 119873 (0 01

2) (37)

where 119883 sim 119873(0 Σ119883) and Σ

119883= 119876Δ119876 with 119876 a randomly

chosen orthogonal matrix and Δ = diag(1120579 2120579 10120579) Wewill see that increasing the parameter 120579 isin [0infin) increasesthe anisotropy of the data that increases the difficulty ofidentifying the correct edr directions

For thismodel it is known that SIRwillmiss the directionalong the second variable119883

2 So we focus on the comparison

of KSIR and RKSIR in this exampleIf we use a second-order polynomial kernel 119896(119909 119911) =

(1 + 119909119879119911)2 that corresponds to the feature space

Φ (119883) = 1119883119894 (119883119894119883119895)119894le119895 119894 119895 = 1 10 (38)

then 1198831+ 1198832

2can be captured in one edr direction Ideally

the first KSIR variate V = ⟨1205731 120601(119883)⟩ should be equivalent to

1198831+ 1198832

2modulo shift and scale

V minus EV prop 1198831+ 1198832

2minus E (119883

1+ 1198832

2) (39)

So for this example given estimates of KSIR variates at the 119899data points V

119894119899

119894=1= ⟨1205731 120601(119909119894)⟩119899

119894=1 the error of the first edr

direction can bemeasured by the least squares fitting of Vwithrespect to (119883

1+ 1198832

2)

error = min119886119887isinR

1

119899

119899

sum

119894=1

(V119894minus (119886 (119909

1198941+ 1199092

1198942) + 119887))

2

(40)

We drew 200 observations from the model specifiedin (37) and then we applied the two dimension reductionmethods KSIR and RKSIR The mean and standard errorsof 100 repetitions of this procedure are reported in Figure 2The result shows that KSIR becomes more andmore unstableas 120579 increases and the regularization helps to reduce thisinstability

43 Importance of Nonlinearity and Regularization in RealData When SIR is applied to classification problems it isequivalent to a Fisher discriminant analysis For the case ofmulticlass classification it is natural to use SIR and considereach class as a slice Kernel forms of Fisher discriminantanalysis (KFDA) [8] have been used to construct nonlineardiscriminant surfaces and the regularization has improvedperformance of KFDA [25] In this example we show thatthis idea of adding a nonlinearity and a regularization termimproves predictive accuracy in a realmulticlass classificationdata set the classification of handwritten digits

The MNIST data set (Y LeCun httpyannlecuncomexdbmnist) contains 60 000 images of handwritten digits0 1 2 9 as training data and 10 000 images as test dataEach image consists of 119901 = 28 times 28 = 784 gray-scalepixel intensities It is commonly believed that there is clearnonlinear structure in this 784-dimensional space

minus05 0 05 1 15 2 250

100

200

300

400

500

600

700

120579 parameter

Erro

r rat

e

KSIRRKSIR

Figure 2 Error in edr as a function of 120579

We compared regularized SIR (RSIR) as in (26) KSIRand RKSIR on this data to examine the effect of regulariza-tion and nonlinearity Each draw of the training set consistedof 100 observations of each digit We then computed thetop 10 edr directions using these 1000 observations and10 slices one for each digit We projected the 10 000 testobservations onto the edr directions and used a k-nearestneighbor (kNN) classifier with 119896 = 5 to classify the test dataThe accuracy of the kNN classifier without dimensionreduction was used as a baseline For KSIR and RKSIR weused a Gaussian kernel with the bandwith parameter setas the median pairwise distance between observations Theregularization parameter was set by cross-validation

The mean and standard deviation of the classificationaccuracy over 100 iterations of this procedure are reportedin Table 1 The first interesting observation is that the lin-ear dimension reduction does not capture discriminativeinformation as the classification accuracywithout dimensionreduction is better Nonlinearity does increase classifica-tion accuracy and coupling regularization with nonlinearityincreases accuracymoreThis improvement is dramatic for 23 5 and 8

5 Discussion

The interest in manifold learning and nonlinear dimensionreduction in both statistics and machine learning has led toa variety of statistical models and algorithms However mostof these methods are developed in the unsupervised learningframework Therefore the estimated dimensions may notbe optimal for the regression models Our work incorpo-rates nonlinearity and regularization to inverse regressionapproaches and results in a robust response driven nonlineardimension reduction method

RKHS has also been introduced into supervised dimen-sion reduction in [26] where the conditional covariance

8 Abstract and Applied Analysis

Table 1 Mean and standard deviations for error rates in classification of digits

Digit RKSIR KSIR RSIR kNN0 00273 (00089) 00472 (00191) 00487 (00128) 00291 (00071)1 00150 (00049) 00177 (00051) 00292 (00113) 00052 (00012)2 01039 (00207) 01475 (00497) 01921 (00238) 02008 (00186)3 00845 (00208) 01279 (00494) 01723 (00283) 01092 (00130)4 00784 (00240) 01044 (00461) 01327 (00327) 01617 (00213)5 00877 (00209) 01327 (00540) 02146 (00294) 01419 (00193)6 00472 (00108) 00804 (00383) 00816 (00172) 00446 (00081)7 00887 (00169) 01119 (00357) 01354 (00172) 01140 (00125)8 00981 (00259) 01490 (00699) 01981 (00286) 01140 (00156)9 00774 (00251) 01095 (00398) 01533 (00212) 02006 (00153)Average 00708 (00105) 01016 (00190) 01358 (00093) 01177 (00039)

operators on such kernel spaces are used to characterize theconditional independence between linear projections and theresponse variable Therefore their method estimates linearedr subspaces while in [10 11] and in this paper RKHS isused to model the edr subspace which leads to nonlineardimension reduction

There are several open issues in regularized kernel SIRmethod such as the selection of kernels regularizationparameters and number of dimensions A direct assessmentof the nonlinear edr directions is expected to reduce thecomputational burden in procedures based on cross valida-tion While these are well established in linear dimensionreduction however little is known for nonlinear dimensionreduction We would like to leave them for future research

There are some interesting connections between KSIRand functional SIR which are developed by Ferre and hiscoauthors in a series of papers [14 17 19] In functional SIRthe observable data are functions and the goal is to find linearedr directions for functional data analysis In KSIR theobservable data are typically not functions but mapped into afunction space in order to characterize nonlinear structuresThis suggests that computations involved in functional SIRcan be simplified by a parametrization with respect to anRKHS or using a linear kernel in the parametrized functionspace On the other hand from a theoretical point of viewKSIR can be viewed as a special case of functional SIRalthough our current theoretical results on KSIR are differentfrom the ones for functional SIR

Appendices

A Proof of Theorem 3

Under the assumption of Proposition 2 for each 119884 = 119910

E [120601 (119883) | 119884 = 119910] isin span Σ120573119894 119894 = 1 119889 (A1)

Since Γ = cov[E(120601(119883) | 119884)] the rank of Γ (ie thedimension of the image of Γ) is less than 119889 which impliesthat Γ is compact With the fact that it is symmetric andsemipositive there exist 119889

Γpositive eigenvalues 120591

119894119889Γ

119894=1and

eigenvectors 119906119894119889Γ

119894=1 such that Γ = sum119889Γ

119894=1120591119894119906119894otimes 119906119894

Recall that for any 119891 isinH

Γ119891 = E ⟨E [120601 (119883) | 119884] 119891⟩E [120601 (119883) | 119884] (A2)

also belongs to

span Σ120573119894 119894 = 1 119889 sub range (Σ) (A3)

because of (A1) so

119906119894=1

120591119894

Γ119906119894isin range (Σ) (A4)

This proves (i)Since for each 119891 isin H Γ119891 isin range (Σminus1) the operator

119879 = Σminus1Γ is well defined over the whole space Moreover

119879119891 = Σminus1(

119889Γ

sum

119894=1

⟨119906119894 119891⟩ 119906119894)

=

119889Γ

sum

119894=1

⟨119906119894 119891⟩ Σ

minus1(119906119894) = (

119889Γ

sum

119894=1

Σminus1(119906119894) otimes 119906119894)119891

(A5)

This proves (ii)

B Proof of Proposition 7

We first prove the proposition for matrices to simplify thennotation we then extend the result to the operators where119889119870is infinite and a matrix form does not make senseLet Φ = [120601(119909

1) 120601(119909

119899)] It has the following SVD

decomposition

Φ = 119880119863119881119879

= [1199061sdot sdot sdot 119906119889119870] [

119863119889times119889

0119889times(119899minus119889)

0(119889119870minus119889)times119889

0(119889119870minus119889)times(119899minus119889)

][[

[

V1198791

V119879119899

]]

]

= 119880119863119881119879

(B1)

where 119880 = [1199061 119906

119889] 119881 = [V

1 V

119889] and 119863 = 119863

119889times119889is a

diagonal matrix of dimension 119889 le 119899

Abstract and Applied Analysis 9

We need to show the KSIR variates

V119895= 119888119879

119895[119896 (119909 119909

1) 119896 (119909 119909

119899)]

= 119888119879

119895Φ119879Φ (119909) = ⟨Φ119888119895 120601 (119909)⟩ = ⟨120573119895 120601 (119909)⟩

(B2)

It suffices to prove that if (120582 119888) is a solution to (28) then (120582 120573)is also a pair of eigenvalue and eigenvector of (Σ2 + 120574119868)minus1ΣΓand vice versa where 119888 and 120573 are related by

120573 = Φ119888 119888 = 119881119863119880119879

120573 (B3)

Noting that facts Σ = (1119899)ΦΦ119879 Γ = (1119899)Φ119869Φ119879 and 119870 =Φ119879Φ = 119881119863

2

119881119879 the argument may be made as follows

ΣΓ120573 = 120582 (Σ2+ 120574119868) 120573

lArrrArr ΦΦ119879Φ119869Φ119879Φ119888 = 120582 (ΦΦ

119879ΦΦ119879Φ119888 + 119899

2120574Φ119888)

lArrrArr Φ119870119869119870119888 = 120582Φ (1198702+ 1198992120574) 119888

(lArrrArr 119881119881119879

119870119869119870119888 = 120582119881119881119879

(1198702+ 1198992120574119868) 119888)

lArrrArr 119870119869119870119888 = 120582 (1198702+ 1198992120574119868) 119888

(B4)

Note that the implication in the third step is necessary only intherArr direction which is obtained by multiplying both sides119881119863minus1

119880119879 and using the facts119880119879119880 = 119868

119889 For the last step since

119881119879

119881 = 119868119889 we use the following facts

119881119881119879

119870 = 119881119881119879

1198811198632

119881119879

= 1198811198632

119881119879

= 119870

119881119881119879

119888 = 119881119881119879

119881119863minus1

119880119879

120573 = 119881119863minus1

119880119879

120573 = 119888

(B5)

In order for this result to hold rigorously when the RKHSis infinite dimensional we need to formally define Φ Φ119879and the SVD of Φ when 119889

119870is infinite For the infinite

dimensional case Φ is an operator from R119899 to H119870defined

by ΦV = sum119899119894=1

V119894120601(119909119894) for V = (V

1 V

119899)119879isin R119899 and Φ119879

is its adjoint an operator from H119870to R119899 such that Φ119879119891 =

(⟨120601(1199091) 119891⟩119870 ⟨120601(119909

1) 119891⟩119870)119879 for 119891 isin H

119870 The notions 119880

and 119880119879 are similarly definedThe above formulation ofΦ andΦ119879 coincides the defini-

tion of Σ as a covariance operator Since the rank of Σ is lessthan 119899 it is compact and has the following representation

Σ =

119889119870

sum

119894=1

119894119906119894otimes 119906119894=

119889

sum

119894=1

120590119894119906119894otimes 119906119894 (B6)

where 119889 le 119899 is the rank and 1205901ge 1205902ge sdot sdot sdot ge 120590

119889gt 120590119889+1=

sdot sdot sdot = 0 This implies that each 120601(119909119894) lies in span(119906

1 119906

119889)

and hence we can write 120601(119909119894) = 119880120591

119894 where 119880 = (119906

1 119906

119889)

should be considered as an operator fromR119889 toH119870and 120591119894isin

R119889 Denote Υ = (1205911 120591

119899)119879isin R119899times119889 It is easy to check that

Υ119879Υ = diag(119899120590

1 119899120590

119889) Let119863

119889times119889= diag(radic1198991205901 radic119899120590119889)

and119881 = Υ119863minus1Then we obtain the SVD forΦ asΦ = 119880119863119881119879

which is well defined

C Proof of Consistency

C1 Preliminaries In order to prove Theorems 9 and 10we use the properties of the Hilbert-Schmidt operatorscovariance operators for the Hilbert space valued randomvariables and the perturbation theory for linear operators Inthis subsection we provide a brief introduction to them Fordetails see [27ndash29] and references therein

Given a separable Hilbert space H of dimension 119901Ha linear operator 119871 on H is said to belong to the Hilbert-Schmidt class if

1198712

HS =

119901H

sum

119894=1

100381710038171003817100381711987111989011989410038171003817100381710038172

Hlt infin (C1)

where 119890119894 is an orthonormal basisTheHilbert-Schmidt class

forms a new Hilbert space with norm sdot HSGiven a bounded operator 119878 onH the operators 119878119871 and

119871119878 both belong to theHilbert-Schmidt class and the followingholds

119878119871HS le 119878 119871HS 119871119878HS le 119871HS 119878 (C2)

where sdot denotes the default operator norm

1198712= sup119891isinH

100381710038171003817100381711987111989110038171003817100381710038172

100381710038171003817100381711989110038171003817100381710038172 (C3)

Let 119885 be a random vector taking values in H satisfyingE1198852lt infin The covariance operator

Σ = E [(119885 minus E119885) otimes (119885 minus E119885)] (C4)

is self-adjoint positive compact and belongs to the Hilbert-Schmidt class

A well-known result from perturbation theory for linearoperators states that if a set of linear operators 119879

119899converges

to 119879 in the Hilbert-Schmidt norm and the eigenvalues of 119879are nondegenerate then the eigenvalues and eigenvectors of119879119899converge to those of 119879 with same rate or convergence as

the convergence of the operators

C2 Proof of Theorem 10 We will use the following resultfrom [14]

10038171003817100381710038171003817Σ minus Σ

10038171003817100381710038171003817HS = 119874119901 (1

radic119899)

10038171003817100381710038171003817Γ minus Γ

10038171003817100381710038171003817HS = 119874119901 (1

radic119899)

(C5)

To simplify the notion we denote 119904= (Σ2+ 119904119868)minus1

ΣΓ Alsodefine

1198791= (Σ2+ 119904119868)minus1

ΣΓ 1198792= (Σ2+ 119904119868)minus1

ΣΓ (C6)

Then10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS

le10038171003817100381710038171003817119904minus 1198791

10038171003817100381710038171003817HS +10038171003817100381710038171198791 minus 1198792

1003817100381710038171003817HS +10038171003817100381710038171198792 minus 119879

1003817100381710038171003817HS

(C7)

10 Abstract and Applied Analysis

For the first term observe that10038171003817100381710038171003817119904minus 1198791

10038171003817100381710038171003817HSle100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817

10038171003817100381710038171003817ΣΓ minus ΣΓ

10038171003817100381710038171003817HS= 119874119901(1

119904radic119899)

(C8)For the second term note that

1198791=

119889Γ

sum

119895=1

120591119895((Σ2+ 119904119868)minus1

Σ119906119895) otimes 119906119895

1198792=

119889Γ

sum

119895=1

120591119895((Σ2+ 119904119868)minus1

119906119895) otimes 119906119895

(C9)

Therefore

10038171003817100381710038171198791 minus 11987921003817100381710038171003817HS =

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1

minus (Σ2+ 119904119868)minus1

) Σ119906119895

100381710038171003817100381710038171003817

(C10)Since 119906

119895isin range(Σ) there exists

119895such that 119906

119895= Σ119895 Then

((Σ2+ 119904119868)minus1

minus (Σ2+ 119904119868)minus1

) Σ119906119895

= (Σ2+ 119904119868)minus1

(Σ2minus Σ2) (Σ2+ 119904119868)minus1

Σ2119895

(C11)

which implies

10038171003817100381710038171198791 minus 11987921003817100381710038171003817 le

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817

10038171003817100381710038171003817Σ2minus Σ210038171003817100381710038171003817HS

times100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ2100381710038171003817100381710038171003817

10038171003817100381710038171003817119895

10038171003817100381710038171003817

= 119874119901(1

119904radic119899)

(C12)

For the third term the following holds

10038171003817100381710038171198792 minus 11987910038171003817100381710038172

HS =

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1

Σ minus Σminus1) 119906119895

100381710038171003817100381710038171003817

2

(C13)

and for each 119895 = 1 119889Γ

100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ119906119895minus Σminus1119906119895

100381710038171003817100381710038171003817

le100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ2119895minus 119895

100381710038171003817100381710038171003817

=

1003817100381710038171003817100381710038171003817100381710038171003817

infin

sum

119894=1

(1199082

119895

119904 + 1199082119895

minus 1)⟨119895 119890119894⟩ 119890119894

1003817100381710038171003817100381710038171003817100381710038171003817

= (

infin

sum

119894=1

1199042

(119904 + 1199082119894)2⟨119895 119890119894⟩2

)

12

le119904

119908119873

(

119873

sum

119894=1

⟨119895 119890119894⟩2

)

12

+ (

infin

sum

119894=119873+1

⟨119895 119890119894⟩2

)

12

=119904

1199082119873

10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp

119873(119895)10038171003817100381710038171003817

(C14)

Combining these terms results in (33)

Since Πperp119873(119895) rarr 0 as119873 rarr infin consequently we have

10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS= 119900119901 (1) (C15)

if 119904 rarr 0 and 119904radic119899 rarr infinIf all the edr directions 120573

119894depend only on a finite

number of eigenvectors of the covariance operator then thereexist some 119873 gt 1 such that Slowast = spanΣ119890

119894 119894 = 1 119873

This implies that

119895= Σminus1119906119895isin Σminus1(Slowast) sub span 119890

119894 119894 = 1 119873 (C16)

Therefore Πperp119873(119895) = 0 Let 119904 = 119874(119899minus14) the rate is119874(11989914)

Acknowledgments

The authors acknowledge the support of the National Sci-ence Foundation (DMS-0732276 and DMS-0732260) and theNational Institutes ofHealth (P50GM081883) Any opinionsfindings and conclusions or recommendations expressed inthis work are those of the authors and do not necessarilyreflect the views of the NSF or NIH

References

[1] K Li ldquoSliced inverse regression for dimension reduction (withdiscussion)rdquo Journal of the American Statistical Association vol86 no 414 pp 316ndash342 1991

[2] N Duan and K Li ldquoSlicing regression a link-free regressionmethodrdquoTheAnnals of Statistics vol 19 no 2 pp 505ndash530 1991

[3] R Cook and S Weisberg ldquoCommentrdquo Journal of the AmericanStatistical Association vol 86 no 414 pp 328ndash332 1991

[4] K C Li ldquoOn principal hessian directions for data visualizationand dimension reduction another application of steinrsquos lemmardquoJournal of the American Statistical Association vol 87 no 420pp 1025ndash1039 1992

[5] L Li R D Cook and C-L Tsai ldquoPartial inverse regressionrdquoBiometrika vol 94 no 3 pp 615ndash625 2007

[6] B Scholkopf A J Smola and K Muller ldquoKernel principalcomponent analysisrdquo in Proceedings of the Artificial NeuralNetworks (ICANN rsquo97) W Gerstner A Germond M Haslerand J-D Nicoud Eds vol 1327 of Lecture Notes in ComputerScience pp 583ndash588 Springer Lausanne Switzerland October1997

[7] B Scholkopf and A J Smola Learning with Kernels The MITPress Massachusetts Mass USA 2002

[8] G Baudat and F Anouar ldquoGeneralized discriminant analysisusing a kernel approachrdquo Neural Computation vol 12 no 10pp 2385ndash2404 2000

[9] F R Bach and M I Jordan ldquoKernel independent componentanalysisrdquo Journal of Machine Learning Research vol 3 no 1 pp1ndash48 2002

[10] H-M Wu ldquoKernel sliced inverse regression with applicationsto classificationrdquo Journal of Computational and Graphical Statis-tics vol 17 no 3 pp 590ndash610 2008

[11] Y-R Yeh S-Y Huang and Y-J Lee ldquoNonlinear dimensionreduction with kernel sliced inverse regressionrdquo IEEE Trans-actions on Knowledge and Data Engineering vol 21 no 11 pp1590ndash1603 2009

Abstract and Applied Analysis 11

[12] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society of London A vol 209 no 441ndash458 pp 415ndash446 1909

[13] H Konig Eigenvalue Distribution of Compact Operators vol16 of Operator Theory Advances and Applications CH BaselSwitzerland 1986

[14] L Ferre and A Yao ldquoFunctional sliced inverse regression anal-ysisrdquo Statistics vol 37 no 6 pp 475ndash488 2003

[15] P Hall and K-C Li ldquoOn almost linearity of low-dimensionalprojections from high-dimensional datardquo The Annals of Statis-tics vol 21 no 2 pp 867ndash889 1993

[16] G He H-G Muller and J-L Wang ldquoFunctional canonicalanalysis for square integrable stochastic processesrdquo Journal ofMultivariate Analysis vol 85 no 1 pp 54ndash77 2003

[17] L Ferre and A Yao ldquoSmoothed functional inverse regressionrdquoStatistica Sinica vol 15 no 3 pp 665ndash683 2005

[18] W Zhong P Zeng P Ma J S Liu and Y Zhu ldquoRSIR regular-ized sliced inverse regression for motif discoveryrdquo Bioinformat-ics vol 21 no 22 pp 4169ndash4175 2005

[19] L Ferre and N Villa ldquoMultilayer perceptron with functionalinputs an inverse regression approachrdquo Scandinavian Journalof Statistics vol 33 no 4 pp 807ndash823 2006

[20] L Li andX Yin ldquoSliced inverse regressionwith regularizationsrdquoBiometrics vol 64 no 1 pp 124ndash131 2008

[21] C Bernard-Michel L Gardes and S Girard ldquoGaussian regular-ized sliced inverse regressionrdquo Statistics and Computing vol 19no 1 pp 85ndash98 2009

[22] T Hsing and R J Carroll ldquoAn asymptotic theory for slicedinverse regressionrdquo The Annals of Statistics vol 20 no 2 pp1040ndash1061 1992

[23] J Saracco ldquoAn asymptotic theory for sliced inverse regressionrdquoCommunications in Statistics vol 26 no 9 pp 2141ndash2171 1997

[24] L X Zhu and K W Ng ldquoAsymptotics of sliced inverse regres-sionrdquo Statistica Sinica vol 5 no 2 pp 727ndash736 1995

[25] T Kurita and T Taguchi ldquoA kernel-based fisher discriminantanalysis for face detectionrdquo IEICE Transactions on Informationand Systems vol 88 no 3 pp 628ndash635 2005

[26] K Fukumizu F R Bach and M I Jordan ldquoKernel dimensionreduction in regressionrdquo The Annals of Statistics vol 37 no 4pp 1871ndash1905 2009

[27] T Kato Perturbation Theory for Linear Operators SpringerBerlin Germany 1966

[28] F Chatelin Spectral Approximation of Linear Operators Aca-demic Press 1983

[29] G Blanchard O Bousquet and L Zwald ldquoStatistical propertiesof kernel principal component analysisrdquoMachine Learning vol66 no 2-3 pp 259ndash294 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 3: Research Article Kernel Sliced Inverse Regression ...downloads.hindawi.com/journals/aaa/2013/540725.pdf · high-dimensional space [ ]. is conforms to the argument in [ ] that the

Abstract and Applied Analysis 3

variables119883We call 120573119895119889

119895=1the nonlinear edr directions and

S = span(1205731 120573

119889) the nonlinear edr spaceThe following

proposition [10] extends the theoretical foundation of SIR tothis nonlinear setting

Proposition 2 Assume the following linear design conditionfor H that for any 119891 isin H there exists a vector 119887 isin R119889 suchthat

E [⟨119891 120601 (119883)⟩ | 119878 (119883)] = 119887119879119878 (119883)

with 119878 (119883) = (⟨1205731 120601 (119883)⟩ ⟨120573119889 120601 (119883)⟩)119879

(11)

Then for the model specified in (9) the inverse regression curveE[120601(119883) | 119884] is contained in the span of (Σ120573

1 Σ120573

119889) where

Σ is the covariance operator of 120601(119883)

Proposition 2 is a straightforward extension of the multi-variate case in [1] to a Hilbert space or a direct application ofthe functional SIR setting in [14] Although the linear designcondition (11) may be difficult to check in practice it has beenshown that such a condition usually holds approximately in ahigh-dimensional space [15] This conforms to the argumentin [11] that the linearity in a reproducing kernel Hilbertspace is less strict than the linearity in the Euclidean spaceMoreover it is pointed out in [1 10] that even if the lineardesign condition is violated the bias of SIR variate is usuallynot large

An immediate consequence of Proposition 2 is thatnonlinear edr directions are the eigenvectors correspondingto the largest eigenvalues of the following generalized eigen-decomposition problem

Γ120573 = 120582Σ120573

where Σ = cov [120601 (119883)] Γ = cov [E (120601 (119883) | 119884)] (12)

or equivalently from an eigenanalysis of the operator 119879 =

Σminus1Γ In the infinite dimensional case a technical difficulty

arises since the operator

Σminus1=

infin

sum

119894=1

119908minus1

119894119890119894otimes 119890119894 (13)

is not defined on the entire Hilbert space H So for theoperator 119879 to be well defined we need to show that therange of Γ is indeed in the range of Σ A similar issue alsoarose in the analysis of dimension reduction and canonicalanalysis for functional data [16 17] In these analyses extraconditions are needed for operators like 119879 to be well definedIn KSIR this issue is resolved automatically by the lineardesign condition and extra conditions are not required asstated by the following Theorem see Appendix A for theproof

Theorem 3 Under Assumption 1 and the linear design condi-tion (11) the following hold

(i) the operator Γ is of finite rank 119889Γle 119889 Consequently

it is compact and has the following spectral decomposi-tion

Γ =

119889Γ

sum

119894=1

120591119894119906119894otimes 119906119894 (14)

where 120591119894and 119906

119894are the eigenvalues and eigenvectors

respectively Moreover 119906119894isin range(Σ) for all 119894 = 1

119889Γ

(ii) the eigendecomposition problem (12) is equivalent tothe eigenanalysis of the operator 119879 which takes thefollowing form

119879 =

119889Γ

sum

119894=1

120591119894119906119894otimes Σminus1(119906119894) (15)

3 Regularized Kernel SlicedInverse Regression

Thediscussion in Section 2 implies that nonlinear edr direc-tions can be retrieved by applying the original SIR algorithmin the feature space induced by the Mercer kernel There aresome computational challenges to this idea such as estimatingan infinite dimensional covariance operator and the fact thatthe feature map is often difficult or impossible to compute formany kernels We address these issues by working with innerproducts of the feature map and adding a regularization termto kernel SIR

31 Estimating the Nonlinear edr Directions Given 119899 obser-vations (119909

1 1199101) (119909

119899 119910119899) our objective is to obtain

an estimate of the edr directions (1205731 120573

119889) We first

formulate a procedure almost identical to the standard SIRprocedure except that it operates in the feature spaceH Thishighlights the immediate relation between the SIR and KSIRprocedures

(1) Without loss of generality we assume that themappedpredictor variables aremean zero that issum119899

119894=1120601(119909119894) =

0 for otherwise we can subtract 120601 = (1119899)sum119899119894=1120601(119909119894)

from 120601(119909119894) The sample covariance is estimated by

Σ =1

119899

119899

sum

119894=1

120601 (119909119894) otimes 120601 (119909

119894) (16)

(2) Bin the119884 variables into119867 slices1198661 119866

119867and com-

pute mean vectors of the corresponding mapped pre-dictor variables for each slice

120595ℎ=1

119899ℎ

sum

119894isin119866ℎ

120601 (119909119894) ℎ = 1 119867 (17)

Compute the sample between-group covariance mat-rix

Γ =

119867

sum

ℎ=1

119899ℎ

119899120595ℎotimes 120595ℎ (18)

4 Abstract and Applied Analysis

(3) Estimate the SIR directions 120573119895by solving the general-

ized eigendecomposition problem

Γ120573 = 120582Σ120573 (19)

This procedure is computationally impossible if theRKHS is infinite dimensional or the feature map cannot becomputed (which is the usual case) However the modelgiven in (9) requires not the edr directions but the projec-tion onto these directions that is the 119889 summary statistics

V1= ⟨1205731 120601 (119909)⟩ V119889 = ⟨120573119889 120601 (119909)⟩ (20)

which we call the KSIR variates Like other kernel algorithmsthe kernel trick enables KSIR variates to be efficiently com-puted from only the kernel 119896 not the map 120601

The key quantity in this alternative formulation is thecentred Grammatrix119870 defined by the kernel function 119896(sdot sdot)where

119870119894119895= ⟨120601 (119909

119894) 120601 (119909

119895)⟩ = 119896 (119909

119894 119909119895) (21)

Note that the rank of119870 is less than 119899 so119870 is always singularGiven the centered Gram matrix 119870 the following gener-

alized eigendecomposition problem can be used to computethe KSIR variates

119870119869119870119888 = 1205821198702119888 (22)

where 119888 denotes the 119899-dimensional generalized eigenvectorand 119869 denotes an 119899 times 119899 matrix with 119869

119894119895= 1119899

119898if 119894 119895

are in the 119898th group consisting of 119899119898

observations andzero otherwise The following proposition establishes theequivalence between two eigendecomposition problems (22)and (19) in the recovery of KSIR variates (V

1 V

119889)

Proposition 4 Given the observations (1199091 1199101) (119909

119899 119910119899)

let (1205731 120573

119889) and (119888

1 119888

119889) denote the generalized eigenvec-

tors of (22) and (19) respectively Then for any 119909 isin X and119895 = 1 119889 the following holds

V119895= ⟨120573119895 120601 (119909)⟩ = 119888

119879

119895119870119909

119870119909= (119896 (119909 119909

1) 119896 (119909 119909

119899))119879

(23)

provided that Σ is invertible When Σ is not invertible theconclusion holds modulo the null space of Σ

This result was proven in [10] in which the algorithmwasfurther reduced to solving

119869119870119888 = 120582119870119888 (24)

by canceling 119870 from both sides of (22)

Remark 5 It is important to remark that when Σ is not invert-ible Proposition 4 states that the equivalence between (19)and (22) holds modulo the null space of Σ that is a requi-rement of the representer theorem that is 120573

119895are linear com-

binations of 120601(119909119894) Without this mandated condition we will

see that each eigenvector of (22) produces an eigenvector of(19) while the eigenvector of (19) is not necessarily recoveredby (22)

Remark 6 It is necessary to clarify the difference between (12)and (19) In (12) it is natural to assume that 120573 is orthogonalto the null space of Σ To see this let 120573 = 1205730 + 120573⋆ with 1205730belonging to the null space and 120573⋆ orthogonal to the nullspace Then

E [⟨120573 120601 (119909)⟩2] = ⟨120573 Σ120573⟩ = ⟨120573

⋆ Σ120573⋆⟩ = E [⟨120573

⋆ 120601 (119909)⟩

2]

(25)

that is ⟨120573 120601(119909)⟩ = ⟨120573⋆ 120601(119909)⟩ (as) It means that 1205730does

not contribute to the KSIR variates and thus could be setas 0 However in (19) if 120573 is an eigenvector and 120573⋆ is itsorthogonal component relative to the null space of Σ theidentity ⟨120573 120601(119909)⟩ = ⟨120573⋆ 120601(119909)⟩ is in general not true for anew point 119909 which is different from the observations Thusfrom a theoretical perspective it is not as natural to assumethe representer theorem although it works well in practiceIn this sense the KSIR algorithm based on (22) or (24) doesnot have a thorough mathematical foundation

32 Regularization and Stability Except for the theoreticalsubtleties in applications with relatively small samples theeigendecomposition in (22) is often ill-conditioned resultingin overfitting as well as numerically unstable estimates ofthe edr space This can be addressed by either thresholdingeigenvalues of the estimated covariancematrix Σ or by addinga regularization term to (22) or (24)

We motivate two types of regularization schemes Thefirst one is the traditional ridge regularization It is used inboth linear SIR and functional SIR [18ndash20] which solves theeigendecomposition problem

Γ120573 = 120582 (Σ + 119904119868) 120573 (26)

Here and in the sequel 119868 denotes the identity operator and 119904is a tuning parameter Assuming the representer theorem itskernel form is given as

119869119870119888 = 120582 (119870 + 119899119904119868) 119888 (27)

Another type of regularization is to regularize (22) dire-ctly

119870119869119870119888 = 120582 (1198702+ 1198992119904119868) 119888 (28)

The following proposition whose proof is in Appendix Bstates that solving the generalized eigendecomposition prob-lem (28) is equivalent to finding the eigenvectors of

(Σ2+ 119904119868)minus1

ΣΓ (29)

Proposition 7 Let 119888119895be the eigenvectors of (28) and let 120573

119895

be the eigenvectors of (29) Then the following holds for theregularized KSIR variates

V119895= 119888119879

119895[119896 (119909 119909

1) 119896 (119909 119909

119899)] = ⟨120573

119895 120601 (119909)⟩ (30)

Abstract and Applied Analysis 5

This algorithm is termed as the Tikhonov regularizationFor linear SIR it is shown in [21] that the Tikhonov regular-ization is more efficient than the ridge regularization

Except for the computational stability regularization alsomakes the matrix forms of KSIR (27) and (28) interpretableby justifiable representer theorem

Proposition 8 For both ridge and the Tikhonov regulariza-tion scheme of KSIR the eigenfunctions 120573

119895are linear combina-

tions of 120601(119909119894) 119894 = 1 119899

The conclusion follows from the observation that 120573119895=

(1120582119895)(Γ120573119895minus Σ120573119895) for the ridge regularization and 120573

119895=

(1120582119895)(ΣΓ120573

119895minus Σ2120573119895) for the Tikhonov regularization

To close we remark that KSIR is computationally advan-tageous even for the case of linear models when 119901 ≫ 119899 dueto the fact that the eigendecomposition problem is for 119899 times 119899matrices rather than the 119901 times 119901 matrices in the standard SIRformulation

33 Consistency of Regularized KSIR In this subsection weprove the asymptotic consistency of the edr directionsestimated by regularized KSIR and provide conditions underwhich the rate of convergence is 119874

119901(119899minus14) An important

observation from the proof is that the rate of convergence ofthe edr directions depends on the contribution of the smallprincipal components The rate can be arbitrarily slow if theedr space depends heavily on eigenvectors corresponding tosmall eigenvalues of the covariance operator

Note that various consistency results are available forlinear SIR [22ndash24] These results hold only for the finitedimensional setting and cannot be adapted to KSIR wherethe RKHS is often infinite dimensional Consistency of func-tional SIR has also been studied before In [14] a thresholdingmethod is considered which selects a finite number ofeigenvectors and uses results fromfinite rank operatorsTheirproof of consistency requires stronger and more complicatedconditions than oursThe consistency for functional SIR withridge regularization is proven in [19] but it is of aweaker formthan our result We remark that the consistency results forfunctional SIR can be improved using a similar argument inthis paper

In the following we state the consistency results for theTikhonov regularization A similar result can be proved forthe ridge regularization while the details are omitted

Theorem 9 Assume E119896(119883119883)2 lt infin lim119899rarrinfin

119904(119899) = 0 andlim119899rarrinfin

119904radic119899 = infin then

10038161003816100381610038161003816⟨120573119895 120601 (sdot)⟩ minus ⟨120573119895 120601 (sdot)⟩

10038161003816100381610038161003816= 119900119901 (1) 119895 = 1 119889

Γ (31)

where 119889Γis the rank of Γ ⟨120573

119895 120601(sdot)⟩ is the projection onto the

119895th edr and ⟨120573119895 120601(sdot)⟩ is the projection onto the 119895th edr as

estimated by regularized KSIRIf the edr directions 120573

119895119889Γ

119895=1depend only on a finite

number of eigenvectors of the covariance operator Σ the rateof convergence is 119874(119899minus14)

This theorem is a direct corollary of the following theoremwhich is proven in Appendix C

Theorem 10 Define the projection operator and its comple-ment for each119873 ge 1

Π119873=

119873

sum

119894=1

119890119894otimes 119890119894 Π

perp

119873= 119868 minus Π

119873=

infin

sum

119894=119873+1

119890119894otimes 119890119894 (32)

where 119890119894infin

119894=1are the eigenvectors of the covariance operator Σ

as defined in (8) with the corresponding eigenvalues denotedby 119908119894

Assume E119896(119883119883)2 lt infin For each 119873 ge 1 the followingholds10038171003817100381710038171003817(Σ2+ 119904119868)minus1ΣΓ minus 119879

10038171003817100381710038171003817119867119878

= 119874119901(1

119904radic119899) +

119889Γ

sum

119895=1

(119904

1199082119873

10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp

119873(119895)10038171003817100381710038171003817)

(33)

where 119895= Σminus1119906119895and 119906

119895119889Γ

119894=1are the eigenvectors of Γ as

defined in (14) and sdot 119867119878

denotes the Hilbert-Schmidt normof a linear operator

If 119904 = 119904(119899) satisfy 119904 rarr 0 and 119904radic119899 rarr infin as 119899 rarr infin then10038171003817100381710038171003817(Σ2+ 119904119868)minus1ΣΓ minus 119879

10038171003817100381710038171003817119867119878= 119900119901 (1) (34)

4 Application to Simulated and Real Data

In this section we compare regularized kernel sliced inverseregression (RKSIR) with several other SIR-related dimensionreductionmethodsThe comparisons are used to address twoquestions (1) does regularization improve the performanceof kernel sliced inverse regression and (2)does the nonlinear-ity of kernel sliced inverse regression improve the predictionaccuracy

We would like to remark that the assessment of nonlineardimension reduction methods could be more difficult thanthat of linear ones When the feature mapping 120601 for an RKHSis not available we do not know the true edr directions orsubspace So in that case we will use the prediction accuracyto evaluate the goodness of RKSIR

41 Importance of Nonlinearity and Regularization Our firstexample illustrates that both the nonlinearity and regulariza-tion of RKSIR can significantly improve prediction accuracy

The regression model has ten predictor variables 119883 =

(1198831 119883

10) with each one following a normal distribution

119883119894sim 119873(0 1) A univariate response is given as

119884 = (sin (1198831) + sin (119883

2)) (1 + sin (119883

3)) + 120576 (35)

with noise 120576 sim 119873(0 012) We compare the effectiveness

of the linear dimension reduction methods SIR SAVE andPHD with RKSIR by examining the predictive accuracy of anonlinear kernel regression model on the reduced space Wegenerate 100 training samples and apply the abovemethods to

6 Abstract and Applied Analysis

minus6 minus5 minus4 minus3 minus2 minus1 0 1 202

03

04

05

06

07

08

09

1M

ean

squa

re er

ror

d = 1

d = 2

log10

(s)

(a)

1 2 3 4 5 6 7 8 9 100

02

04

06

08

1

12

14

Number of edr directions

Mea

n sq

uare

erro

r

RKSIRSIR

SAVEPHD

(b)

minus2 minus15 minus1 minus05 0 05 1 15 2minus2

minus15

minus1

minus05

0

05

1

15

2

KSIR variates for the first nonlinear edr direction

sin(X

1)+

sin(X

2)

(c)

2

18

16

14

12

1

08

06

04

02

0minus15 minus1 minus05 0 05 1 15

KSIR variates for the second nonlinear edr direction

1+

sin(X

3)

(d)

Figure 1 Dimension reduction for model (35) (a) mean square error for RKSIR with different regularization parameters (b) mean squareerror for various dimension reduction methods The blue line represents the mean square error without using dimension reduction (c)-(d)RKSIR variates versus nonlinear factors

compute the edr directions In RKSIR we used an additiveGaussian kernel as follows

119896 (119909 119911) =

119889

sum

119895=1

exp(minus(119909119895minus 119911119895)2

21205902) (36)

where 120590 = 2 After projecting the training samples on theestimated edr directions we train a Gaussian kernel regres-sion model based on the new variatesThen the mean squareerror is computed on 1000 independent test samples Thisexperiment is repeated 100 times with all parameters set bycross-validation The results are summarized in Figure 1

Figure 1(a) displays the accuracy for RKSIR as a functionof the regularization parameter illustrating the importance of

selecting regularization parameters KSIRwithout regulariza-tion performs much worse than the RKSIR Figure 1(b) dis-plays the prediction accuracy of various dimension reductionmethods

RKSIR outperforms all the linear dimension reductionmethods which illustrates the power of nonlinearity intro-duced in RKSIR It also suggests that there are essentially twononlinear edr directions This observation seems to agreewith the model in (35) Indeed Figures 1(c) and 1(d) showthat the first two edr directions from RKSIR estimate thetwo nonlinear factors well

42 Effect of Regularization This example illustrates theeffect of regularization on the performance of KSIR as afunction of the anisotropy of the predictors

Abstract and Applied Analysis 7

The regression model has ten predictor variables 119883 =

(1198831 119883

10) and a univariate response specified by

119884 = 1198831+ 1198832

2+ 120576 120576 sim 119873 (0 01

2) (37)

where 119883 sim 119873(0 Σ119883) and Σ

119883= 119876Δ119876 with 119876 a randomly

chosen orthogonal matrix and Δ = diag(1120579 2120579 10120579) Wewill see that increasing the parameter 120579 isin [0infin) increasesthe anisotropy of the data that increases the difficulty ofidentifying the correct edr directions

For thismodel it is known that SIRwillmiss the directionalong the second variable119883

2 So we focus on the comparison

of KSIR and RKSIR in this exampleIf we use a second-order polynomial kernel 119896(119909 119911) =

(1 + 119909119879119911)2 that corresponds to the feature space

Φ (119883) = 1119883119894 (119883119894119883119895)119894le119895 119894 119895 = 1 10 (38)

then 1198831+ 1198832

2can be captured in one edr direction Ideally

the first KSIR variate V = ⟨1205731 120601(119883)⟩ should be equivalent to

1198831+ 1198832

2modulo shift and scale

V minus EV prop 1198831+ 1198832

2minus E (119883

1+ 1198832

2) (39)

So for this example given estimates of KSIR variates at the 119899data points V

119894119899

119894=1= ⟨1205731 120601(119909119894)⟩119899

119894=1 the error of the first edr

direction can bemeasured by the least squares fitting of Vwithrespect to (119883

1+ 1198832

2)

error = min119886119887isinR

1

119899

119899

sum

119894=1

(V119894minus (119886 (119909

1198941+ 1199092

1198942) + 119887))

2

(40)

We drew 200 observations from the model specifiedin (37) and then we applied the two dimension reductionmethods KSIR and RKSIR The mean and standard errorsof 100 repetitions of this procedure are reported in Figure 2The result shows that KSIR becomes more andmore unstableas 120579 increases and the regularization helps to reduce thisinstability

43 Importance of Nonlinearity and Regularization in RealData When SIR is applied to classification problems it isequivalent to a Fisher discriminant analysis For the case ofmulticlass classification it is natural to use SIR and considereach class as a slice Kernel forms of Fisher discriminantanalysis (KFDA) [8] have been used to construct nonlineardiscriminant surfaces and the regularization has improvedperformance of KFDA [25] In this example we show thatthis idea of adding a nonlinearity and a regularization termimproves predictive accuracy in a realmulticlass classificationdata set the classification of handwritten digits

The MNIST data set (Y LeCun httpyannlecuncomexdbmnist) contains 60 000 images of handwritten digits0 1 2 9 as training data and 10 000 images as test dataEach image consists of 119901 = 28 times 28 = 784 gray-scalepixel intensities It is commonly believed that there is clearnonlinear structure in this 784-dimensional space

minus05 0 05 1 15 2 250

100

200

300

400

500

600

700

120579 parameter

Erro

r rat

e

KSIRRKSIR

Figure 2 Error in edr as a function of 120579

We compared regularized SIR (RSIR) as in (26) KSIRand RKSIR on this data to examine the effect of regulariza-tion and nonlinearity Each draw of the training set consistedof 100 observations of each digit We then computed thetop 10 edr directions using these 1000 observations and10 slices one for each digit We projected the 10 000 testobservations onto the edr directions and used a k-nearestneighbor (kNN) classifier with 119896 = 5 to classify the test dataThe accuracy of the kNN classifier without dimensionreduction was used as a baseline For KSIR and RKSIR weused a Gaussian kernel with the bandwith parameter setas the median pairwise distance between observations Theregularization parameter was set by cross-validation

The mean and standard deviation of the classificationaccuracy over 100 iterations of this procedure are reportedin Table 1 The first interesting observation is that the lin-ear dimension reduction does not capture discriminativeinformation as the classification accuracywithout dimensionreduction is better Nonlinearity does increase classifica-tion accuracy and coupling regularization with nonlinearityincreases accuracymoreThis improvement is dramatic for 23 5 and 8

5 Discussion

The interest in manifold learning and nonlinear dimensionreduction in both statistics and machine learning has led toa variety of statistical models and algorithms However mostof these methods are developed in the unsupervised learningframework Therefore the estimated dimensions may notbe optimal for the regression models Our work incorpo-rates nonlinearity and regularization to inverse regressionapproaches and results in a robust response driven nonlineardimension reduction method

RKHS has also been introduced into supervised dimen-sion reduction in [26] where the conditional covariance

8 Abstract and Applied Analysis

Table 1 Mean and standard deviations for error rates in classification of digits

Digit RKSIR KSIR RSIR kNN0 00273 (00089) 00472 (00191) 00487 (00128) 00291 (00071)1 00150 (00049) 00177 (00051) 00292 (00113) 00052 (00012)2 01039 (00207) 01475 (00497) 01921 (00238) 02008 (00186)3 00845 (00208) 01279 (00494) 01723 (00283) 01092 (00130)4 00784 (00240) 01044 (00461) 01327 (00327) 01617 (00213)5 00877 (00209) 01327 (00540) 02146 (00294) 01419 (00193)6 00472 (00108) 00804 (00383) 00816 (00172) 00446 (00081)7 00887 (00169) 01119 (00357) 01354 (00172) 01140 (00125)8 00981 (00259) 01490 (00699) 01981 (00286) 01140 (00156)9 00774 (00251) 01095 (00398) 01533 (00212) 02006 (00153)Average 00708 (00105) 01016 (00190) 01358 (00093) 01177 (00039)

operators on such kernel spaces are used to characterize theconditional independence between linear projections and theresponse variable Therefore their method estimates linearedr subspaces while in [10 11] and in this paper RKHS isused to model the edr subspace which leads to nonlineardimension reduction

There are several open issues in regularized kernel SIRmethod such as the selection of kernels regularizationparameters and number of dimensions A direct assessmentof the nonlinear edr directions is expected to reduce thecomputational burden in procedures based on cross valida-tion While these are well established in linear dimensionreduction however little is known for nonlinear dimensionreduction We would like to leave them for future research

There are some interesting connections between KSIRand functional SIR which are developed by Ferre and hiscoauthors in a series of papers [14 17 19] In functional SIRthe observable data are functions and the goal is to find linearedr directions for functional data analysis In KSIR theobservable data are typically not functions but mapped into afunction space in order to characterize nonlinear structuresThis suggests that computations involved in functional SIRcan be simplified by a parametrization with respect to anRKHS or using a linear kernel in the parametrized functionspace On the other hand from a theoretical point of viewKSIR can be viewed as a special case of functional SIRalthough our current theoretical results on KSIR are differentfrom the ones for functional SIR

Appendices

A Proof of Theorem 3

Under the assumption of Proposition 2 for each 119884 = 119910

E [120601 (119883) | 119884 = 119910] isin span Σ120573119894 119894 = 1 119889 (A1)

Since Γ = cov[E(120601(119883) | 119884)] the rank of Γ (ie thedimension of the image of Γ) is less than 119889 which impliesthat Γ is compact With the fact that it is symmetric andsemipositive there exist 119889

Γpositive eigenvalues 120591

119894119889Γ

119894=1and

eigenvectors 119906119894119889Γ

119894=1 such that Γ = sum119889Γ

119894=1120591119894119906119894otimes 119906119894

Recall that for any 119891 isinH

Γ119891 = E ⟨E [120601 (119883) | 119884] 119891⟩E [120601 (119883) | 119884] (A2)

also belongs to

span Σ120573119894 119894 = 1 119889 sub range (Σ) (A3)

because of (A1) so

119906119894=1

120591119894

Γ119906119894isin range (Σ) (A4)

This proves (i)Since for each 119891 isin H Γ119891 isin range (Σminus1) the operator

119879 = Σminus1Γ is well defined over the whole space Moreover

119879119891 = Σminus1(

119889Γ

sum

119894=1

⟨119906119894 119891⟩ 119906119894)

=

119889Γ

sum

119894=1

⟨119906119894 119891⟩ Σ

minus1(119906119894) = (

119889Γ

sum

119894=1

Σminus1(119906119894) otimes 119906119894)119891

(A5)

This proves (ii)

B Proof of Proposition 7

We first prove the proposition for matrices to simplify thennotation we then extend the result to the operators where119889119870is infinite and a matrix form does not make senseLet Φ = [120601(119909

1) 120601(119909

119899)] It has the following SVD

decomposition

Φ = 119880119863119881119879

= [1199061sdot sdot sdot 119906119889119870] [

119863119889times119889

0119889times(119899minus119889)

0(119889119870minus119889)times119889

0(119889119870minus119889)times(119899minus119889)

][[

[

V1198791

V119879119899

]]

]

= 119880119863119881119879

(B1)

where 119880 = [1199061 119906

119889] 119881 = [V

1 V

119889] and 119863 = 119863

119889times119889is a

diagonal matrix of dimension 119889 le 119899

Abstract and Applied Analysis 9

We need to show the KSIR variates

V119895= 119888119879

119895[119896 (119909 119909

1) 119896 (119909 119909

119899)]

= 119888119879

119895Φ119879Φ (119909) = ⟨Φ119888119895 120601 (119909)⟩ = ⟨120573119895 120601 (119909)⟩

(B2)

It suffices to prove that if (120582 119888) is a solution to (28) then (120582 120573)is also a pair of eigenvalue and eigenvector of (Σ2 + 120574119868)minus1ΣΓand vice versa where 119888 and 120573 are related by

120573 = Φ119888 119888 = 119881119863119880119879

120573 (B3)

Noting that facts Σ = (1119899)ΦΦ119879 Γ = (1119899)Φ119869Φ119879 and 119870 =Φ119879Φ = 119881119863

2

119881119879 the argument may be made as follows

ΣΓ120573 = 120582 (Σ2+ 120574119868) 120573

lArrrArr ΦΦ119879Φ119869Φ119879Φ119888 = 120582 (ΦΦ

119879ΦΦ119879Φ119888 + 119899

2120574Φ119888)

lArrrArr Φ119870119869119870119888 = 120582Φ (1198702+ 1198992120574) 119888

(lArrrArr 119881119881119879

119870119869119870119888 = 120582119881119881119879

(1198702+ 1198992120574119868) 119888)

lArrrArr 119870119869119870119888 = 120582 (1198702+ 1198992120574119868) 119888

(B4)

Note that the implication in the third step is necessary only intherArr direction which is obtained by multiplying both sides119881119863minus1

119880119879 and using the facts119880119879119880 = 119868

119889 For the last step since

119881119879

119881 = 119868119889 we use the following facts

119881119881119879

119870 = 119881119881119879

1198811198632

119881119879

= 1198811198632

119881119879

= 119870

119881119881119879

119888 = 119881119881119879

119881119863minus1

119880119879

120573 = 119881119863minus1

119880119879

120573 = 119888

(B5)

In order for this result to hold rigorously when the RKHSis infinite dimensional we need to formally define Φ Φ119879and the SVD of Φ when 119889

119870is infinite For the infinite

dimensional case Φ is an operator from R119899 to H119870defined

by ΦV = sum119899119894=1

V119894120601(119909119894) for V = (V

1 V

119899)119879isin R119899 and Φ119879

is its adjoint an operator from H119870to R119899 such that Φ119879119891 =

(⟨120601(1199091) 119891⟩119870 ⟨120601(119909

1) 119891⟩119870)119879 for 119891 isin H

119870 The notions 119880

and 119880119879 are similarly definedThe above formulation ofΦ andΦ119879 coincides the defini-

tion of Σ as a covariance operator Since the rank of Σ is lessthan 119899 it is compact and has the following representation

Σ =

119889119870

sum

119894=1

119894119906119894otimes 119906119894=

119889

sum

119894=1

120590119894119906119894otimes 119906119894 (B6)

where 119889 le 119899 is the rank and 1205901ge 1205902ge sdot sdot sdot ge 120590

119889gt 120590119889+1=

sdot sdot sdot = 0 This implies that each 120601(119909119894) lies in span(119906

1 119906

119889)

and hence we can write 120601(119909119894) = 119880120591

119894 where 119880 = (119906

1 119906

119889)

should be considered as an operator fromR119889 toH119870and 120591119894isin

R119889 Denote Υ = (1205911 120591

119899)119879isin R119899times119889 It is easy to check that

Υ119879Υ = diag(119899120590

1 119899120590

119889) Let119863

119889times119889= diag(radic1198991205901 radic119899120590119889)

and119881 = Υ119863minus1Then we obtain the SVD forΦ asΦ = 119880119863119881119879

which is well defined

C Proof of Consistency

C1 Preliminaries In order to prove Theorems 9 and 10we use the properties of the Hilbert-Schmidt operatorscovariance operators for the Hilbert space valued randomvariables and the perturbation theory for linear operators Inthis subsection we provide a brief introduction to them Fordetails see [27ndash29] and references therein

Given a separable Hilbert space H of dimension 119901Ha linear operator 119871 on H is said to belong to the Hilbert-Schmidt class if

1198712

HS =

119901H

sum

119894=1

100381710038171003817100381711987111989011989410038171003817100381710038172

Hlt infin (C1)

where 119890119894 is an orthonormal basisTheHilbert-Schmidt class

forms a new Hilbert space with norm sdot HSGiven a bounded operator 119878 onH the operators 119878119871 and

119871119878 both belong to theHilbert-Schmidt class and the followingholds

119878119871HS le 119878 119871HS 119871119878HS le 119871HS 119878 (C2)

where sdot denotes the default operator norm

1198712= sup119891isinH

100381710038171003817100381711987111989110038171003817100381710038172

100381710038171003817100381711989110038171003817100381710038172 (C3)

Let 119885 be a random vector taking values in H satisfyingE1198852lt infin The covariance operator

Σ = E [(119885 minus E119885) otimes (119885 minus E119885)] (C4)

is self-adjoint positive compact and belongs to the Hilbert-Schmidt class

A well-known result from perturbation theory for linearoperators states that if a set of linear operators 119879

119899converges

to 119879 in the Hilbert-Schmidt norm and the eigenvalues of 119879are nondegenerate then the eigenvalues and eigenvectors of119879119899converge to those of 119879 with same rate or convergence as

the convergence of the operators

C2 Proof of Theorem 10 We will use the following resultfrom [14]

10038171003817100381710038171003817Σ minus Σ

10038171003817100381710038171003817HS = 119874119901 (1

radic119899)

10038171003817100381710038171003817Γ minus Γ

10038171003817100381710038171003817HS = 119874119901 (1

radic119899)

(C5)

To simplify the notion we denote 119904= (Σ2+ 119904119868)minus1

ΣΓ Alsodefine

1198791= (Σ2+ 119904119868)minus1

ΣΓ 1198792= (Σ2+ 119904119868)minus1

ΣΓ (C6)

Then10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS

le10038171003817100381710038171003817119904minus 1198791

10038171003817100381710038171003817HS +10038171003817100381710038171198791 minus 1198792

1003817100381710038171003817HS +10038171003817100381710038171198792 minus 119879

1003817100381710038171003817HS

(C7)

10 Abstract and Applied Analysis

For the first term observe that10038171003817100381710038171003817119904minus 1198791

10038171003817100381710038171003817HSle100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817

10038171003817100381710038171003817ΣΓ minus ΣΓ

10038171003817100381710038171003817HS= 119874119901(1

119904radic119899)

(C8)For the second term note that

1198791=

119889Γ

sum

119895=1

120591119895((Σ2+ 119904119868)minus1

Σ119906119895) otimes 119906119895

1198792=

119889Γ

sum

119895=1

120591119895((Σ2+ 119904119868)minus1

119906119895) otimes 119906119895

(C9)

Therefore

10038171003817100381710038171198791 minus 11987921003817100381710038171003817HS =

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1

minus (Σ2+ 119904119868)minus1

) Σ119906119895

100381710038171003817100381710038171003817

(C10)Since 119906

119895isin range(Σ) there exists

119895such that 119906

119895= Σ119895 Then

((Σ2+ 119904119868)minus1

minus (Σ2+ 119904119868)minus1

) Σ119906119895

= (Σ2+ 119904119868)minus1

(Σ2minus Σ2) (Σ2+ 119904119868)minus1

Σ2119895

(C11)

which implies

10038171003817100381710038171198791 minus 11987921003817100381710038171003817 le

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817

10038171003817100381710038171003817Σ2minus Σ210038171003817100381710038171003817HS

times100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ2100381710038171003817100381710038171003817

10038171003817100381710038171003817119895

10038171003817100381710038171003817

= 119874119901(1

119904radic119899)

(C12)

For the third term the following holds

10038171003817100381710038171198792 minus 11987910038171003817100381710038172

HS =

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1

Σ minus Σminus1) 119906119895

100381710038171003817100381710038171003817

2

(C13)

and for each 119895 = 1 119889Γ

100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ119906119895minus Σminus1119906119895

100381710038171003817100381710038171003817

le100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ2119895minus 119895

100381710038171003817100381710038171003817

=

1003817100381710038171003817100381710038171003817100381710038171003817

infin

sum

119894=1

(1199082

119895

119904 + 1199082119895

minus 1)⟨119895 119890119894⟩ 119890119894

1003817100381710038171003817100381710038171003817100381710038171003817

= (

infin

sum

119894=1

1199042

(119904 + 1199082119894)2⟨119895 119890119894⟩2

)

12

le119904

119908119873

(

119873

sum

119894=1

⟨119895 119890119894⟩2

)

12

+ (

infin

sum

119894=119873+1

⟨119895 119890119894⟩2

)

12

=119904

1199082119873

10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp

119873(119895)10038171003817100381710038171003817

(C14)

Combining these terms results in (33)

Since Πperp119873(119895) rarr 0 as119873 rarr infin consequently we have

10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS= 119900119901 (1) (C15)

if 119904 rarr 0 and 119904radic119899 rarr infinIf all the edr directions 120573

119894depend only on a finite

number of eigenvectors of the covariance operator then thereexist some 119873 gt 1 such that Slowast = spanΣ119890

119894 119894 = 1 119873

This implies that

119895= Σminus1119906119895isin Σminus1(Slowast) sub span 119890

119894 119894 = 1 119873 (C16)

Therefore Πperp119873(119895) = 0 Let 119904 = 119874(119899minus14) the rate is119874(11989914)

Acknowledgments

The authors acknowledge the support of the National Sci-ence Foundation (DMS-0732276 and DMS-0732260) and theNational Institutes ofHealth (P50GM081883) Any opinionsfindings and conclusions or recommendations expressed inthis work are those of the authors and do not necessarilyreflect the views of the NSF or NIH

References

[1] K Li ldquoSliced inverse regression for dimension reduction (withdiscussion)rdquo Journal of the American Statistical Association vol86 no 414 pp 316ndash342 1991

[2] N Duan and K Li ldquoSlicing regression a link-free regressionmethodrdquoTheAnnals of Statistics vol 19 no 2 pp 505ndash530 1991

[3] R Cook and S Weisberg ldquoCommentrdquo Journal of the AmericanStatistical Association vol 86 no 414 pp 328ndash332 1991

[4] K C Li ldquoOn principal hessian directions for data visualizationand dimension reduction another application of steinrsquos lemmardquoJournal of the American Statistical Association vol 87 no 420pp 1025ndash1039 1992

[5] L Li R D Cook and C-L Tsai ldquoPartial inverse regressionrdquoBiometrika vol 94 no 3 pp 615ndash625 2007

[6] B Scholkopf A J Smola and K Muller ldquoKernel principalcomponent analysisrdquo in Proceedings of the Artificial NeuralNetworks (ICANN rsquo97) W Gerstner A Germond M Haslerand J-D Nicoud Eds vol 1327 of Lecture Notes in ComputerScience pp 583ndash588 Springer Lausanne Switzerland October1997

[7] B Scholkopf and A J Smola Learning with Kernels The MITPress Massachusetts Mass USA 2002

[8] G Baudat and F Anouar ldquoGeneralized discriminant analysisusing a kernel approachrdquo Neural Computation vol 12 no 10pp 2385ndash2404 2000

[9] F R Bach and M I Jordan ldquoKernel independent componentanalysisrdquo Journal of Machine Learning Research vol 3 no 1 pp1ndash48 2002

[10] H-M Wu ldquoKernel sliced inverse regression with applicationsto classificationrdquo Journal of Computational and Graphical Statis-tics vol 17 no 3 pp 590ndash610 2008

[11] Y-R Yeh S-Y Huang and Y-J Lee ldquoNonlinear dimensionreduction with kernel sliced inverse regressionrdquo IEEE Trans-actions on Knowledge and Data Engineering vol 21 no 11 pp1590ndash1603 2009

Abstract and Applied Analysis 11

[12] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society of London A vol 209 no 441ndash458 pp 415ndash446 1909

[13] H Konig Eigenvalue Distribution of Compact Operators vol16 of Operator Theory Advances and Applications CH BaselSwitzerland 1986

[14] L Ferre and A Yao ldquoFunctional sliced inverse regression anal-ysisrdquo Statistics vol 37 no 6 pp 475ndash488 2003

[15] P Hall and K-C Li ldquoOn almost linearity of low-dimensionalprojections from high-dimensional datardquo The Annals of Statis-tics vol 21 no 2 pp 867ndash889 1993

[16] G He H-G Muller and J-L Wang ldquoFunctional canonicalanalysis for square integrable stochastic processesrdquo Journal ofMultivariate Analysis vol 85 no 1 pp 54ndash77 2003

[17] L Ferre and A Yao ldquoSmoothed functional inverse regressionrdquoStatistica Sinica vol 15 no 3 pp 665ndash683 2005

[18] W Zhong P Zeng P Ma J S Liu and Y Zhu ldquoRSIR regular-ized sliced inverse regression for motif discoveryrdquo Bioinformat-ics vol 21 no 22 pp 4169ndash4175 2005

[19] L Ferre and N Villa ldquoMultilayer perceptron with functionalinputs an inverse regression approachrdquo Scandinavian Journalof Statistics vol 33 no 4 pp 807ndash823 2006

[20] L Li andX Yin ldquoSliced inverse regressionwith regularizationsrdquoBiometrics vol 64 no 1 pp 124ndash131 2008

[21] C Bernard-Michel L Gardes and S Girard ldquoGaussian regular-ized sliced inverse regressionrdquo Statistics and Computing vol 19no 1 pp 85ndash98 2009

[22] T Hsing and R J Carroll ldquoAn asymptotic theory for slicedinverse regressionrdquo The Annals of Statistics vol 20 no 2 pp1040ndash1061 1992

[23] J Saracco ldquoAn asymptotic theory for sliced inverse regressionrdquoCommunications in Statistics vol 26 no 9 pp 2141ndash2171 1997

[24] L X Zhu and K W Ng ldquoAsymptotics of sliced inverse regres-sionrdquo Statistica Sinica vol 5 no 2 pp 727ndash736 1995

[25] T Kurita and T Taguchi ldquoA kernel-based fisher discriminantanalysis for face detectionrdquo IEICE Transactions on Informationand Systems vol 88 no 3 pp 628ndash635 2005

[26] K Fukumizu F R Bach and M I Jordan ldquoKernel dimensionreduction in regressionrdquo The Annals of Statistics vol 37 no 4pp 1871ndash1905 2009

[27] T Kato Perturbation Theory for Linear Operators SpringerBerlin Germany 1966

[28] F Chatelin Spectral Approximation of Linear Operators Aca-demic Press 1983

[29] G Blanchard O Bousquet and L Zwald ldquoStatistical propertiesof kernel principal component analysisrdquoMachine Learning vol66 no 2-3 pp 259ndash294 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 4: Research Article Kernel Sliced Inverse Regression ...downloads.hindawi.com/journals/aaa/2013/540725.pdf · high-dimensional space [ ]. is conforms to the argument in [ ] that the

4 Abstract and Applied Analysis

(3) Estimate the SIR directions 120573119895by solving the general-

ized eigendecomposition problem

Γ120573 = 120582Σ120573 (19)

This procedure is computationally impossible if theRKHS is infinite dimensional or the feature map cannot becomputed (which is the usual case) However the modelgiven in (9) requires not the edr directions but the projec-tion onto these directions that is the 119889 summary statistics

V1= ⟨1205731 120601 (119909)⟩ V119889 = ⟨120573119889 120601 (119909)⟩ (20)

which we call the KSIR variates Like other kernel algorithmsthe kernel trick enables KSIR variates to be efficiently com-puted from only the kernel 119896 not the map 120601

The key quantity in this alternative formulation is thecentred Grammatrix119870 defined by the kernel function 119896(sdot sdot)where

119870119894119895= ⟨120601 (119909

119894) 120601 (119909

119895)⟩ = 119896 (119909

119894 119909119895) (21)

Note that the rank of119870 is less than 119899 so119870 is always singularGiven the centered Gram matrix 119870 the following gener-

alized eigendecomposition problem can be used to computethe KSIR variates

119870119869119870119888 = 1205821198702119888 (22)

where 119888 denotes the 119899-dimensional generalized eigenvectorand 119869 denotes an 119899 times 119899 matrix with 119869

119894119895= 1119899

119898if 119894 119895

are in the 119898th group consisting of 119899119898

observations andzero otherwise The following proposition establishes theequivalence between two eigendecomposition problems (22)and (19) in the recovery of KSIR variates (V

1 V

119889)

Proposition 4 Given the observations (1199091 1199101) (119909

119899 119910119899)

let (1205731 120573

119889) and (119888

1 119888

119889) denote the generalized eigenvec-

tors of (22) and (19) respectively Then for any 119909 isin X and119895 = 1 119889 the following holds

V119895= ⟨120573119895 120601 (119909)⟩ = 119888

119879

119895119870119909

119870119909= (119896 (119909 119909

1) 119896 (119909 119909

119899))119879

(23)

provided that Σ is invertible When Σ is not invertible theconclusion holds modulo the null space of Σ

This result was proven in [10] in which the algorithmwasfurther reduced to solving

119869119870119888 = 120582119870119888 (24)

by canceling 119870 from both sides of (22)

Remark 5 It is important to remark that when Σ is not invert-ible Proposition 4 states that the equivalence between (19)and (22) holds modulo the null space of Σ that is a requi-rement of the representer theorem that is 120573

119895are linear com-

binations of 120601(119909119894) Without this mandated condition we will

see that each eigenvector of (22) produces an eigenvector of(19) while the eigenvector of (19) is not necessarily recoveredby (22)

Remark 6 It is necessary to clarify the difference between (12)and (19) In (12) it is natural to assume that 120573 is orthogonalto the null space of Σ To see this let 120573 = 1205730 + 120573⋆ with 1205730belonging to the null space and 120573⋆ orthogonal to the nullspace Then

E [⟨120573 120601 (119909)⟩2] = ⟨120573 Σ120573⟩ = ⟨120573

⋆ Σ120573⋆⟩ = E [⟨120573

⋆ 120601 (119909)⟩

2]

(25)

that is ⟨120573 120601(119909)⟩ = ⟨120573⋆ 120601(119909)⟩ (as) It means that 1205730does

not contribute to the KSIR variates and thus could be setas 0 However in (19) if 120573 is an eigenvector and 120573⋆ is itsorthogonal component relative to the null space of Σ theidentity ⟨120573 120601(119909)⟩ = ⟨120573⋆ 120601(119909)⟩ is in general not true for anew point 119909 which is different from the observations Thusfrom a theoretical perspective it is not as natural to assumethe representer theorem although it works well in practiceIn this sense the KSIR algorithm based on (22) or (24) doesnot have a thorough mathematical foundation

32 Regularization and Stability Except for the theoreticalsubtleties in applications with relatively small samples theeigendecomposition in (22) is often ill-conditioned resultingin overfitting as well as numerically unstable estimates ofthe edr space This can be addressed by either thresholdingeigenvalues of the estimated covariancematrix Σ or by addinga regularization term to (22) or (24)

We motivate two types of regularization schemes Thefirst one is the traditional ridge regularization It is used inboth linear SIR and functional SIR [18ndash20] which solves theeigendecomposition problem

Γ120573 = 120582 (Σ + 119904119868) 120573 (26)

Here and in the sequel 119868 denotes the identity operator and 119904is a tuning parameter Assuming the representer theorem itskernel form is given as

119869119870119888 = 120582 (119870 + 119899119904119868) 119888 (27)

Another type of regularization is to regularize (22) dire-ctly

119870119869119870119888 = 120582 (1198702+ 1198992119904119868) 119888 (28)

The following proposition whose proof is in Appendix Bstates that solving the generalized eigendecomposition prob-lem (28) is equivalent to finding the eigenvectors of

(Σ2+ 119904119868)minus1

ΣΓ (29)

Proposition 7 Let 119888119895be the eigenvectors of (28) and let 120573

119895

be the eigenvectors of (29) Then the following holds for theregularized KSIR variates

V119895= 119888119879

119895[119896 (119909 119909

1) 119896 (119909 119909

119899)] = ⟨120573

119895 120601 (119909)⟩ (30)

Abstract and Applied Analysis 5

This algorithm is termed as the Tikhonov regularizationFor linear SIR it is shown in [21] that the Tikhonov regular-ization is more efficient than the ridge regularization

Except for the computational stability regularization alsomakes the matrix forms of KSIR (27) and (28) interpretableby justifiable representer theorem

Proposition 8 For both ridge and the Tikhonov regulariza-tion scheme of KSIR the eigenfunctions 120573

119895are linear combina-

tions of 120601(119909119894) 119894 = 1 119899

The conclusion follows from the observation that 120573119895=

(1120582119895)(Γ120573119895minus Σ120573119895) for the ridge regularization and 120573

119895=

(1120582119895)(ΣΓ120573

119895minus Σ2120573119895) for the Tikhonov regularization

To close we remark that KSIR is computationally advan-tageous even for the case of linear models when 119901 ≫ 119899 dueto the fact that the eigendecomposition problem is for 119899 times 119899matrices rather than the 119901 times 119901 matrices in the standard SIRformulation

33 Consistency of Regularized KSIR In this subsection weprove the asymptotic consistency of the edr directionsestimated by regularized KSIR and provide conditions underwhich the rate of convergence is 119874

119901(119899minus14) An important

observation from the proof is that the rate of convergence ofthe edr directions depends on the contribution of the smallprincipal components The rate can be arbitrarily slow if theedr space depends heavily on eigenvectors corresponding tosmall eigenvalues of the covariance operator

Note that various consistency results are available forlinear SIR [22ndash24] These results hold only for the finitedimensional setting and cannot be adapted to KSIR wherethe RKHS is often infinite dimensional Consistency of func-tional SIR has also been studied before In [14] a thresholdingmethod is considered which selects a finite number ofeigenvectors and uses results fromfinite rank operatorsTheirproof of consistency requires stronger and more complicatedconditions than oursThe consistency for functional SIR withridge regularization is proven in [19] but it is of aweaker formthan our result We remark that the consistency results forfunctional SIR can be improved using a similar argument inthis paper

In the following we state the consistency results for theTikhonov regularization A similar result can be proved forthe ridge regularization while the details are omitted

Theorem 9 Assume E119896(119883119883)2 lt infin lim119899rarrinfin

119904(119899) = 0 andlim119899rarrinfin

119904radic119899 = infin then

10038161003816100381610038161003816⟨120573119895 120601 (sdot)⟩ minus ⟨120573119895 120601 (sdot)⟩

10038161003816100381610038161003816= 119900119901 (1) 119895 = 1 119889

Γ (31)

where 119889Γis the rank of Γ ⟨120573

119895 120601(sdot)⟩ is the projection onto the

119895th edr and ⟨120573119895 120601(sdot)⟩ is the projection onto the 119895th edr as

estimated by regularized KSIRIf the edr directions 120573

119895119889Γ

119895=1depend only on a finite

number of eigenvectors of the covariance operator Σ the rateof convergence is 119874(119899minus14)

This theorem is a direct corollary of the following theoremwhich is proven in Appendix C

Theorem 10 Define the projection operator and its comple-ment for each119873 ge 1

Π119873=

119873

sum

119894=1

119890119894otimes 119890119894 Π

perp

119873= 119868 minus Π

119873=

infin

sum

119894=119873+1

119890119894otimes 119890119894 (32)

where 119890119894infin

119894=1are the eigenvectors of the covariance operator Σ

as defined in (8) with the corresponding eigenvalues denotedby 119908119894

Assume E119896(119883119883)2 lt infin For each 119873 ge 1 the followingholds10038171003817100381710038171003817(Σ2+ 119904119868)minus1ΣΓ minus 119879

10038171003817100381710038171003817119867119878

= 119874119901(1

119904radic119899) +

119889Γ

sum

119895=1

(119904

1199082119873

10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp

119873(119895)10038171003817100381710038171003817)

(33)

where 119895= Σminus1119906119895and 119906

119895119889Γ

119894=1are the eigenvectors of Γ as

defined in (14) and sdot 119867119878

denotes the Hilbert-Schmidt normof a linear operator

If 119904 = 119904(119899) satisfy 119904 rarr 0 and 119904radic119899 rarr infin as 119899 rarr infin then10038171003817100381710038171003817(Σ2+ 119904119868)minus1ΣΓ minus 119879

10038171003817100381710038171003817119867119878= 119900119901 (1) (34)

4 Application to Simulated and Real Data

In this section we compare regularized kernel sliced inverseregression (RKSIR) with several other SIR-related dimensionreductionmethodsThe comparisons are used to address twoquestions (1) does regularization improve the performanceof kernel sliced inverse regression and (2)does the nonlinear-ity of kernel sliced inverse regression improve the predictionaccuracy

We would like to remark that the assessment of nonlineardimension reduction methods could be more difficult thanthat of linear ones When the feature mapping 120601 for an RKHSis not available we do not know the true edr directions orsubspace So in that case we will use the prediction accuracyto evaluate the goodness of RKSIR

41 Importance of Nonlinearity and Regularization Our firstexample illustrates that both the nonlinearity and regulariza-tion of RKSIR can significantly improve prediction accuracy

The regression model has ten predictor variables 119883 =

(1198831 119883

10) with each one following a normal distribution

119883119894sim 119873(0 1) A univariate response is given as

119884 = (sin (1198831) + sin (119883

2)) (1 + sin (119883

3)) + 120576 (35)

with noise 120576 sim 119873(0 012) We compare the effectiveness

of the linear dimension reduction methods SIR SAVE andPHD with RKSIR by examining the predictive accuracy of anonlinear kernel regression model on the reduced space Wegenerate 100 training samples and apply the abovemethods to

6 Abstract and Applied Analysis

minus6 minus5 minus4 minus3 minus2 minus1 0 1 202

03

04

05

06

07

08

09

1M

ean

squa

re er

ror

d = 1

d = 2

log10

(s)

(a)

1 2 3 4 5 6 7 8 9 100

02

04

06

08

1

12

14

Number of edr directions

Mea

n sq

uare

erro

r

RKSIRSIR

SAVEPHD

(b)

minus2 minus15 minus1 minus05 0 05 1 15 2minus2

minus15

minus1

minus05

0

05

1

15

2

KSIR variates for the first nonlinear edr direction

sin(X

1)+

sin(X

2)

(c)

2

18

16

14

12

1

08

06

04

02

0minus15 minus1 minus05 0 05 1 15

KSIR variates for the second nonlinear edr direction

1+

sin(X

3)

(d)

Figure 1 Dimension reduction for model (35) (a) mean square error for RKSIR with different regularization parameters (b) mean squareerror for various dimension reduction methods The blue line represents the mean square error without using dimension reduction (c)-(d)RKSIR variates versus nonlinear factors

compute the edr directions In RKSIR we used an additiveGaussian kernel as follows

119896 (119909 119911) =

119889

sum

119895=1

exp(minus(119909119895minus 119911119895)2

21205902) (36)

where 120590 = 2 After projecting the training samples on theestimated edr directions we train a Gaussian kernel regres-sion model based on the new variatesThen the mean squareerror is computed on 1000 independent test samples Thisexperiment is repeated 100 times with all parameters set bycross-validation The results are summarized in Figure 1

Figure 1(a) displays the accuracy for RKSIR as a functionof the regularization parameter illustrating the importance of

selecting regularization parameters KSIRwithout regulariza-tion performs much worse than the RKSIR Figure 1(b) dis-plays the prediction accuracy of various dimension reductionmethods

RKSIR outperforms all the linear dimension reductionmethods which illustrates the power of nonlinearity intro-duced in RKSIR It also suggests that there are essentially twononlinear edr directions This observation seems to agreewith the model in (35) Indeed Figures 1(c) and 1(d) showthat the first two edr directions from RKSIR estimate thetwo nonlinear factors well

42 Effect of Regularization This example illustrates theeffect of regularization on the performance of KSIR as afunction of the anisotropy of the predictors

Abstract and Applied Analysis 7

The regression model has ten predictor variables 119883 =

(1198831 119883

10) and a univariate response specified by

119884 = 1198831+ 1198832

2+ 120576 120576 sim 119873 (0 01

2) (37)

where 119883 sim 119873(0 Σ119883) and Σ

119883= 119876Δ119876 with 119876 a randomly

chosen orthogonal matrix and Δ = diag(1120579 2120579 10120579) Wewill see that increasing the parameter 120579 isin [0infin) increasesthe anisotropy of the data that increases the difficulty ofidentifying the correct edr directions

For thismodel it is known that SIRwillmiss the directionalong the second variable119883

2 So we focus on the comparison

of KSIR and RKSIR in this exampleIf we use a second-order polynomial kernel 119896(119909 119911) =

(1 + 119909119879119911)2 that corresponds to the feature space

Φ (119883) = 1119883119894 (119883119894119883119895)119894le119895 119894 119895 = 1 10 (38)

then 1198831+ 1198832

2can be captured in one edr direction Ideally

the first KSIR variate V = ⟨1205731 120601(119883)⟩ should be equivalent to

1198831+ 1198832

2modulo shift and scale

V minus EV prop 1198831+ 1198832

2minus E (119883

1+ 1198832

2) (39)

So for this example given estimates of KSIR variates at the 119899data points V

119894119899

119894=1= ⟨1205731 120601(119909119894)⟩119899

119894=1 the error of the first edr

direction can bemeasured by the least squares fitting of Vwithrespect to (119883

1+ 1198832

2)

error = min119886119887isinR

1

119899

119899

sum

119894=1

(V119894minus (119886 (119909

1198941+ 1199092

1198942) + 119887))

2

(40)

We drew 200 observations from the model specifiedin (37) and then we applied the two dimension reductionmethods KSIR and RKSIR The mean and standard errorsof 100 repetitions of this procedure are reported in Figure 2The result shows that KSIR becomes more andmore unstableas 120579 increases and the regularization helps to reduce thisinstability

43 Importance of Nonlinearity and Regularization in RealData When SIR is applied to classification problems it isequivalent to a Fisher discriminant analysis For the case ofmulticlass classification it is natural to use SIR and considereach class as a slice Kernel forms of Fisher discriminantanalysis (KFDA) [8] have been used to construct nonlineardiscriminant surfaces and the regularization has improvedperformance of KFDA [25] In this example we show thatthis idea of adding a nonlinearity and a regularization termimproves predictive accuracy in a realmulticlass classificationdata set the classification of handwritten digits

The MNIST data set (Y LeCun httpyannlecuncomexdbmnist) contains 60 000 images of handwritten digits0 1 2 9 as training data and 10 000 images as test dataEach image consists of 119901 = 28 times 28 = 784 gray-scalepixel intensities It is commonly believed that there is clearnonlinear structure in this 784-dimensional space

minus05 0 05 1 15 2 250

100

200

300

400

500

600

700

120579 parameter

Erro

r rat

e

KSIRRKSIR

Figure 2 Error in edr as a function of 120579

We compared regularized SIR (RSIR) as in (26) KSIRand RKSIR on this data to examine the effect of regulariza-tion and nonlinearity Each draw of the training set consistedof 100 observations of each digit We then computed thetop 10 edr directions using these 1000 observations and10 slices one for each digit We projected the 10 000 testobservations onto the edr directions and used a k-nearestneighbor (kNN) classifier with 119896 = 5 to classify the test dataThe accuracy of the kNN classifier without dimensionreduction was used as a baseline For KSIR and RKSIR weused a Gaussian kernel with the bandwith parameter setas the median pairwise distance between observations Theregularization parameter was set by cross-validation

The mean and standard deviation of the classificationaccuracy over 100 iterations of this procedure are reportedin Table 1 The first interesting observation is that the lin-ear dimension reduction does not capture discriminativeinformation as the classification accuracywithout dimensionreduction is better Nonlinearity does increase classifica-tion accuracy and coupling regularization with nonlinearityincreases accuracymoreThis improvement is dramatic for 23 5 and 8

5 Discussion

The interest in manifold learning and nonlinear dimensionreduction in both statistics and machine learning has led toa variety of statistical models and algorithms However mostof these methods are developed in the unsupervised learningframework Therefore the estimated dimensions may notbe optimal for the regression models Our work incorpo-rates nonlinearity and regularization to inverse regressionapproaches and results in a robust response driven nonlineardimension reduction method

RKHS has also been introduced into supervised dimen-sion reduction in [26] where the conditional covariance

8 Abstract and Applied Analysis

Table 1 Mean and standard deviations for error rates in classification of digits

Digit RKSIR KSIR RSIR kNN0 00273 (00089) 00472 (00191) 00487 (00128) 00291 (00071)1 00150 (00049) 00177 (00051) 00292 (00113) 00052 (00012)2 01039 (00207) 01475 (00497) 01921 (00238) 02008 (00186)3 00845 (00208) 01279 (00494) 01723 (00283) 01092 (00130)4 00784 (00240) 01044 (00461) 01327 (00327) 01617 (00213)5 00877 (00209) 01327 (00540) 02146 (00294) 01419 (00193)6 00472 (00108) 00804 (00383) 00816 (00172) 00446 (00081)7 00887 (00169) 01119 (00357) 01354 (00172) 01140 (00125)8 00981 (00259) 01490 (00699) 01981 (00286) 01140 (00156)9 00774 (00251) 01095 (00398) 01533 (00212) 02006 (00153)Average 00708 (00105) 01016 (00190) 01358 (00093) 01177 (00039)

operators on such kernel spaces are used to characterize theconditional independence between linear projections and theresponse variable Therefore their method estimates linearedr subspaces while in [10 11] and in this paper RKHS isused to model the edr subspace which leads to nonlineardimension reduction

There are several open issues in regularized kernel SIRmethod such as the selection of kernels regularizationparameters and number of dimensions A direct assessmentof the nonlinear edr directions is expected to reduce thecomputational burden in procedures based on cross valida-tion While these are well established in linear dimensionreduction however little is known for nonlinear dimensionreduction We would like to leave them for future research

There are some interesting connections between KSIRand functional SIR which are developed by Ferre and hiscoauthors in a series of papers [14 17 19] In functional SIRthe observable data are functions and the goal is to find linearedr directions for functional data analysis In KSIR theobservable data are typically not functions but mapped into afunction space in order to characterize nonlinear structuresThis suggests that computations involved in functional SIRcan be simplified by a parametrization with respect to anRKHS or using a linear kernel in the parametrized functionspace On the other hand from a theoretical point of viewKSIR can be viewed as a special case of functional SIRalthough our current theoretical results on KSIR are differentfrom the ones for functional SIR

Appendices

A Proof of Theorem 3

Under the assumption of Proposition 2 for each 119884 = 119910

E [120601 (119883) | 119884 = 119910] isin span Σ120573119894 119894 = 1 119889 (A1)

Since Γ = cov[E(120601(119883) | 119884)] the rank of Γ (ie thedimension of the image of Γ) is less than 119889 which impliesthat Γ is compact With the fact that it is symmetric andsemipositive there exist 119889

Γpositive eigenvalues 120591

119894119889Γ

119894=1and

eigenvectors 119906119894119889Γ

119894=1 such that Γ = sum119889Γ

119894=1120591119894119906119894otimes 119906119894

Recall that for any 119891 isinH

Γ119891 = E ⟨E [120601 (119883) | 119884] 119891⟩E [120601 (119883) | 119884] (A2)

also belongs to

span Σ120573119894 119894 = 1 119889 sub range (Σ) (A3)

because of (A1) so

119906119894=1

120591119894

Γ119906119894isin range (Σ) (A4)

This proves (i)Since for each 119891 isin H Γ119891 isin range (Σminus1) the operator

119879 = Σminus1Γ is well defined over the whole space Moreover

119879119891 = Σminus1(

119889Γ

sum

119894=1

⟨119906119894 119891⟩ 119906119894)

=

119889Γ

sum

119894=1

⟨119906119894 119891⟩ Σ

minus1(119906119894) = (

119889Γ

sum

119894=1

Σminus1(119906119894) otimes 119906119894)119891

(A5)

This proves (ii)

B Proof of Proposition 7

We first prove the proposition for matrices to simplify thennotation we then extend the result to the operators where119889119870is infinite and a matrix form does not make senseLet Φ = [120601(119909

1) 120601(119909

119899)] It has the following SVD

decomposition

Φ = 119880119863119881119879

= [1199061sdot sdot sdot 119906119889119870] [

119863119889times119889

0119889times(119899minus119889)

0(119889119870minus119889)times119889

0(119889119870minus119889)times(119899minus119889)

][[

[

V1198791

V119879119899

]]

]

= 119880119863119881119879

(B1)

where 119880 = [1199061 119906

119889] 119881 = [V

1 V

119889] and 119863 = 119863

119889times119889is a

diagonal matrix of dimension 119889 le 119899

Abstract and Applied Analysis 9

We need to show the KSIR variates

V119895= 119888119879

119895[119896 (119909 119909

1) 119896 (119909 119909

119899)]

= 119888119879

119895Φ119879Φ (119909) = ⟨Φ119888119895 120601 (119909)⟩ = ⟨120573119895 120601 (119909)⟩

(B2)

It suffices to prove that if (120582 119888) is a solution to (28) then (120582 120573)is also a pair of eigenvalue and eigenvector of (Σ2 + 120574119868)minus1ΣΓand vice versa where 119888 and 120573 are related by

120573 = Φ119888 119888 = 119881119863119880119879

120573 (B3)

Noting that facts Σ = (1119899)ΦΦ119879 Γ = (1119899)Φ119869Φ119879 and 119870 =Φ119879Φ = 119881119863

2

119881119879 the argument may be made as follows

ΣΓ120573 = 120582 (Σ2+ 120574119868) 120573

lArrrArr ΦΦ119879Φ119869Φ119879Φ119888 = 120582 (ΦΦ

119879ΦΦ119879Φ119888 + 119899

2120574Φ119888)

lArrrArr Φ119870119869119870119888 = 120582Φ (1198702+ 1198992120574) 119888

(lArrrArr 119881119881119879

119870119869119870119888 = 120582119881119881119879

(1198702+ 1198992120574119868) 119888)

lArrrArr 119870119869119870119888 = 120582 (1198702+ 1198992120574119868) 119888

(B4)

Note that the implication in the third step is necessary only intherArr direction which is obtained by multiplying both sides119881119863minus1

119880119879 and using the facts119880119879119880 = 119868

119889 For the last step since

119881119879

119881 = 119868119889 we use the following facts

119881119881119879

119870 = 119881119881119879

1198811198632

119881119879

= 1198811198632

119881119879

= 119870

119881119881119879

119888 = 119881119881119879

119881119863minus1

119880119879

120573 = 119881119863minus1

119880119879

120573 = 119888

(B5)

In order for this result to hold rigorously when the RKHSis infinite dimensional we need to formally define Φ Φ119879and the SVD of Φ when 119889

119870is infinite For the infinite

dimensional case Φ is an operator from R119899 to H119870defined

by ΦV = sum119899119894=1

V119894120601(119909119894) for V = (V

1 V

119899)119879isin R119899 and Φ119879

is its adjoint an operator from H119870to R119899 such that Φ119879119891 =

(⟨120601(1199091) 119891⟩119870 ⟨120601(119909

1) 119891⟩119870)119879 for 119891 isin H

119870 The notions 119880

and 119880119879 are similarly definedThe above formulation ofΦ andΦ119879 coincides the defini-

tion of Σ as a covariance operator Since the rank of Σ is lessthan 119899 it is compact and has the following representation

Σ =

119889119870

sum

119894=1

119894119906119894otimes 119906119894=

119889

sum

119894=1

120590119894119906119894otimes 119906119894 (B6)

where 119889 le 119899 is the rank and 1205901ge 1205902ge sdot sdot sdot ge 120590

119889gt 120590119889+1=

sdot sdot sdot = 0 This implies that each 120601(119909119894) lies in span(119906

1 119906

119889)

and hence we can write 120601(119909119894) = 119880120591

119894 where 119880 = (119906

1 119906

119889)

should be considered as an operator fromR119889 toH119870and 120591119894isin

R119889 Denote Υ = (1205911 120591

119899)119879isin R119899times119889 It is easy to check that

Υ119879Υ = diag(119899120590

1 119899120590

119889) Let119863

119889times119889= diag(radic1198991205901 radic119899120590119889)

and119881 = Υ119863minus1Then we obtain the SVD forΦ asΦ = 119880119863119881119879

which is well defined

C Proof of Consistency

C1 Preliminaries In order to prove Theorems 9 and 10we use the properties of the Hilbert-Schmidt operatorscovariance operators for the Hilbert space valued randomvariables and the perturbation theory for linear operators Inthis subsection we provide a brief introduction to them Fordetails see [27ndash29] and references therein

Given a separable Hilbert space H of dimension 119901Ha linear operator 119871 on H is said to belong to the Hilbert-Schmidt class if

1198712

HS =

119901H

sum

119894=1

100381710038171003817100381711987111989011989410038171003817100381710038172

Hlt infin (C1)

where 119890119894 is an orthonormal basisTheHilbert-Schmidt class

forms a new Hilbert space with norm sdot HSGiven a bounded operator 119878 onH the operators 119878119871 and

119871119878 both belong to theHilbert-Schmidt class and the followingholds

119878119871HS le 119878 119871HS 119871119878HS le 119871HS 119878 (C2)

where sdot denotes the default operator norm

1198712= sup119891isinH

100381710038171003817100381711987111989110038171003817100381710038172

100381710038171003817100381711989110038171003817100381710038172 (C3)

Let 119885 be a random vector taking values in H satisfyingE1198852lt infin The covariance operator

Σ = E [(119885 minus E119885) otimes (119885 minus E119885)] (C4)

is self-adjoint positive compact and belongs to the Hilbert-Schmidt class

A well-known result from perturbation theory for linearoperators states that if a set of linear operators 119879

119899converges

to 119879 in the Hilbert-Schmidt norm and the eigenvalues of 119879are nondegenerate then the eigenvalues and eigenvectors of119879119899converge to those of 119879 with same rate or convergence as

the convergence of the operators

C2 Proof of Theorem 10 We will use the following resultfrom [14]

10038171003817100381710038171003817Σ minus Σ

10038171003817100381710038171003817HS = 119874119901 (1

radic119899)

10038171003817100381710038171003817Γ minus Γ

10038171003817100381710038171003817HS = 119874119901 (1

radic119899)

(C5)

To simplify the notion we denote 119904= (Σ2+ 119904119868)minus1

ΣΓ Alsodefine

1198791= (Σ2+ 119904119868)minus1

ΣΓ 1198792= (Σ2+ 119904119868)minus1

ΣΓ (C6)

Then10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS

le10038171003817100381710038171003817119904minus 1198791

10038171003817100381710038171003817HS +10038171003817100381710038171198791 minus 1198792

1003817100381710038171003817HS +10038171003817100381710038171198792 minus 119879

1003817100381710038171003817HS

(C7)

10 Abstract and Applied Analysis

For the first term observe that10038171003817100381710038171003817119904minus 1198791

10038171003817100381710038171003817HSle100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817

10038171003817100381710038171003817ΣΓ minus ΣΓ

10038171003817100381710038171003817HS= 119874119901(1

119904radic119899)

(C8)For the second term note that

1198791=

119889Γ

sum

119895=1

120591119895((Σ2+ 119904119868)minus1

Σ119906119895) otimes 119906119895

1198792=

119889Γ

sum

119895=1

120591119895((Σ2+ 119904119868)minus1

119906119895) otimes 119906119895

(C9)

Therefore

10038171003817100381710038171198791 minus 11987921003817100381710038171003817HS =

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1

minus (Σ2+ 119904119868)minus1

) Σ119906119895

100381710038171003817100381710038171003817

(C10)Since 119906

119895isin range(Σ) there exists

119895such that 119906

119895= Σ119895 Then

((Σ2+ 119904119868)minus1

minus (Σ2+ 119904119868)minus1

) Σ119906119895

= (Σ2+ 119904119868)minus1

(Σ2minus Σ2) (Σ2+ 119904119868)minus1

Σ2119895

(C11)

which implies

10038171003817100381710038171198791 minus 11987921003817100381710038171003817 le

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817

10038171003817100381710038171003817Σ2minus Σ210038171003817100381710038171003817HS

times100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ2100381710038171003817100381710038171003817

10038171003817100381710038171003817119895

10038171003817100381710038171003817

= 119874119901(1

119904radic119899)

(C12)

For the third term the following holds

10038171003817100381710038171198792 minus 11987910038171003817100381710038172

HS =

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1

Σ minus Σminus1) 119906119895

100381710038171003817100381710038171003817

2

(C13)

and for each 119895 = 1 119889Γ

100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ119906119895minus Σminus1119906119895

100381710038171003817100381710038171003817

le100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ2119895minus 119895

100381710038171003817100381710038171003817

=

1003817100381710038171003817100381710038171003817100381710038171003817

infin

sum

119894=1

(1199082

119895

119904 + 1199082119895

minus 1)⟨119895 119890119894⟩ 119890119894

1003817100381710038171003817100381710038171003817100381710038171003817

= (

infin

sum

119894=1

1199042

(119904 + 1199082119894)2⟨119895 119890119894⟩2

)

12

le119904

119908119873

(

119873

sum

119894=1

⟨119895 119890119894⟩2

)

12

+ (

infin

sum

119894=119873+1

⟨119895 119890119894⟩2

)

12

=119904

1199082119873

10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp

119873(119895)10038171003817100381710038171003817

(C14)

Combining these terms results in (33)

Since Πperp119873(119895) rarr 0 as119873 rarr infin consequently we have

10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS= 119900119901 (1) (C15)

if 119904 rarr 0 and 119904radic119899 rarr infinIf all the edr directions 120573

119894depend only on a finite

number of eigenvectors of the covariance operator then thereexist some 119873 gt 1 such that Slowast = spanΣ119890

119894 119894 = 1 119873

This implies that

119895= Σminus1119906119895isin Σminus1(Slowast) sub span 119890

119894 119894 = 1 119873 (C16)

Therefore Πperp119873(119895) = 0 Let 119904 = 119874(119899minus14) the rate is119874(11989914)

Acknowledgments

The authors acknowledge the support of the National Sci-ence Foundation (DMS-0732276 and DMS-0732260) and theNational Institutes ofHealth (P50GM081883) Any opinionsfindings and conclusions or recommendations expressed inthis work are those of the authors and do not necessarilyreflect the views of the NSF or NIH

References

[1] K Li ldquoSliced inverse regression for dimension reduction (withdiscussion)rdquo Journal of the American Statistical Association vol86 no 414 pp 316ndash342 1991

[2] N Duan and K Li ldquoSlicing regression a link-free regressionmethodrdquoTheAnnals of Statistics vol 19 no 2 pp 505ndash530 1991

[3] R Cook and S Weisberg ldquoCommentrdquo Journal of the AmericanStatistical Association vol 86 no 414 pp 328ndash332 1991

[4] K C Li ldquoOn principal hessian directions for data visualizationand dimension reduction another application of steinrsquos lemmardquoJournal of the American Statistical Association vol 87 no 420pp 1025ndash1039 1992

[5] L Li R D Cook and C-L Tsai ldquoPartial inverse regressionrdquoBiometrika vol 94 no 3 pp 615ndash625 2007

[6] B Scholkopf A J Smola and K Muller ldquoKernel principalcomponent analysisrdquo in Proceedings of the Artificial NeuralNetworks (ICANN rsquo97) W Gerstner A Germond M Haslerand J-D Nicoud Eds vol 1327 of Lecture Notes in ComputerScience pp 583ndash588 Springer Lausanne Switzerland October1997

[7] B Scholkopf and A J Smola Learning with Kernels The MITPress Massachusetts Mass USA 2002

[8] G Baudat and F Anouar ldquoGeneralized discriminant analysisusing a kernel approachrdquo Neural Computation vol 12 no 10pp 2385ndash2404 2000

[9] F R Bach and M I Jordan ldquoKernel independent componentanalysisrdquo Journal of Machine Learning Research vol 3 no 1 pp1ndash48 2002

[10] H-M Wu ldquoKernel sliced inverse regression with applicationsto classificationrdquo Journal of Computational and Graphical Statis-tics vol 17 no 3 pp 590ndash610 2008

[11] Y-R Yeh S-Y Huang and Y-J Lee ldquoNonlinear dimensionreduction with kernel sliced inverse regressionrdquo IEEE Trans-actions on Knowledge and Data Engineering vol 21 no 11 pp1590ndash1603 2009

Abstract and Applied Analysis 11

[12] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society of London A vol 209 no 441ndash458 pp 415ndash446 1909

[13] H Konig Eigenvalue Distribution of Compact Operators vol16 of Operator Theory Advances and Applications CH BaselSwitzerland 1986

[14] L Ferre and A Yao ldquoFunctional sliced inverse regression anal-ysisrdquo Statistics vol 37 no 6 pp 475ndash488 2003

[15] P Hall and K-C Li ldquoOn almost linearity of low-dimensionalprojections from high-dimensional datardquo The Annals of Statis-tics vol 21 no 2 pp 867ndash889 1993

[16] G He H-G Muller and J-L Wang ldquoFunctional canonicalanalysis for square integrable stochastic processesrdquo Journal ofMultivariate Analysis vol 85 no 1 pp 54ndash77 2003

[17] L Ferre and A Yao ldquoSmoothed functional inverse regressionrdquoStatistica Sinica vol 15 no 3 pp 665ndash683 2005

[18] W Zhong P Zeng P Ma J S Liu and Y Zhu ldquoRSIR regular-ized sliced inverse regression for motif discoveryrdquo Bioinformat-ics vol 21 no 22 pp 4169ndash4175 2005

[19] L Ferre and N Villa ldquoMultilayer perceptron with functionalinputs an inverse regression approachrdquo Scandinavian Journalof Statistics vol 33 no 4 pp 807ndash823 2006

[20] L Li andX Yin ldquoSliced inverse regressionwith regularizationsrdquoBiometrics vol 64 no 1 pp 124ndash131 2008

[21] C Bernard-Michel L Gardes and S Girard ldquoGaussian regular-ized sliced inverse regressionrdquo Statistics and Computing vol 19no 1 pp 85ndash98 2009

[22] T Hsing and R J Carroll ldquoAn asymptotic theory for slicedinverse regressionrdquo The Annals of Statistics vol 20 no 2 pp1040ndash1061 1992

[23] J Saracco ldquoAn asymptotic theory for sliced inverse regressionrdquoCommunications in Statistics vol 26 no 9 pp 2141ndash2171 1997

[24] L X Zhu and K W Ng ldquoAsymptotics of sliced inverse regres-sionrdquo Statistica Sinica vol 5 no 2 pp 727ndash736 1995

[25] T Kurita and T Taguchi ldquoA kernel-based fisher discriminantanalysis for face detectionrdquo IEICE Transactions on Informationand Systems vol 88 no 3 pp 628ndash635 2005

[26] K Fukumizu F R Bach and M I Jordan ldquoKernel dimensionreduction in regressionrdquo The Annals of Statistics vol 37 no 4pp 1871ndash1905 2009

[27] T Kato Perturbation Theory for Linear Operators SpringerBerlin Germany 1966

[28] F Chatelin Spectral Approximation of Linear Operators Aca-demic Press 1983

[29] G Blanchard O Bousquet and L Zwald ldquoStatistical propertiesof kernel principal component analysisrdquoMachine Learning vol66 no 2-3 pp 259ndash294 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 5: Research Article Kernel Sliced Inverse Regression ...downloads.hindawi.com/journals/aaa/2013/540725.pdf · high-dimensional space [ ]. is conforms to the argument in [ ] that the

Abstract and Applied Analysis 5

This algorithm is termed as the Tikhonov regularizationFor linear SIR it is shown in [21] that the Tikhonov regular-ization is more efficient than the ridge regularization

Except for the computational stability regularization alsomakes the matrix forms of KSIR (27) and (28) interpretableby justifiable representer theorem

Proposition 8 For both ridge and the Tikhonov regulariza-tion scheme of KSIR the eigenfunctions 120573

119895are linear combina-

tions of 120601(119909119894) 119894 = 1 119899

The conclusion follows from the observation that 120573119895=

(1120582119895)(Γ120573119895minus Σ120573119895) for the ridge regularization and 120573

119895=

(1120582119895)(ΣΓ120573

119895minus Σ2120573119895) for the Tikhonov regularization

To close we remark that KSIR is computationally advan-tageous even for the case of linear models when 119901 ≫ 119899 dueto the fact that the eigendecomposition problem is for 119899 times 119899matrices rather than the 119901 times 119901 matrices in the standard SIRformulation

33 Consistency of Regularized KSIR In this subsection weprove the asymptotic consistency of the edr directionsestimated by regularized KSIR and provide conditions underwhich the rate of convergence is 119874

119901(119899minus14) An important

observation from the proof is that the rate of convergence ofthe edr directions depends on the contribution of the smallprincipal components The rate can be arbitrarily slow if theedr space depends heavily on eigenvectors corresponding tosmall eigenvalues of the covariance operator

Note that various consistency results are available forlinear SIR [22ndash24] These results hold only for the finitedimensional setting and cannot be adapted to KSIR wherethe RKHS is often infinite dimensional Consistency of func-tional SIR has also been studied before In [14] a thresholdingmethod is considered which selects a finite number ofeigenvectors and uses results fromfinite rank operatorsTheirproof of consistency requires stronger and more complicatedconditions than oursThe consistency for functional SIR withridge regularization is proven in [19] but it is of aweaker formthan our result We remark that the consistency results forfunctional SIR can be improved using a similar argument inthis paper

In the following we state the consistency results for theTikhonov regularization A similar result can be proved forthe ridge regularization while the details are omitted

Theorem 9 Assume E119896(119883119883)2 lt infin lim119899rarrinfin

119904(119899) = 0 andlim119899rarrinfin

119904radic119899 = infin then

10038161003816100381610038161003816⟨120573119895 120601 (sdot)⟩ minus ⟨120573119895 120601 (sdot)⟩

10038161003816100381610038161003816= 119900119901 (1) 119895 = 1 119889

Γ (31)

where 119889Γis the rank of Γ ⟨120573

119895 120601(sdot)⟩ is the projection onto the

119895th edr and ⟨120573119895 120601(sdot)⟩ is the projection onto the 119895th edr as

estimated by regularized KSIRIf the edr directions 120573

119895119889Γ

119895=1depend only on a finite

number of eigenvectors of the covariance operator Σ the rateof convergence is 119874(119899minus14)

This theorem is a direct corollary of the following theoremwhich is proven in Appendix C

Theorem 10 Define the projection operator and its comple-ment for each119873 ge 1

Π119873=

119873

sum

119894=1

119890119894otimes 119890119894 Π

perp

119873= 119868 minus Π

119873=

infin

sum

119894=119873+1

119890119894otimes 119890119894 (32)

where 119890119894infin

119894=1are the eigenvectors of the covariance operator Σ

as defined in (8) with the corresponding eigenvalues denotedby 119908119894

Assume E119896(119883119883)2 lt infin For each 119873 ge 1 the followingholds10038171003817100381710038171003817(Σ2+ 119904119868)minus1ΣΓ minus 119879

10038171003817100381710038171003817119867119878

= 119874119901(1

119904radic119899) +

119889Γ

sum

119895=1

(119904

1199082119873

10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp

119873(119895)10038171003817100381710038171003817)

(33)

where 119895= Σminus1119906119895and 119906

119895119889Γ

119894=1are the eigenvectors of Γ as

defined in (14) and sdot 119867119878

denotes the Hilbert-Schmidt normof a linear operator

If 119904 = 119904(119899) satisfy 119904 rarr 0 and 119904radic119899 rarr infin as 119899 rarr infin then10038171003817100381710038171003817(Σ2+ 119904119868)minus1ΣΓ minus 119879

10038171003817100381710038171003817119867119878= 119900119901 (1) (34)

4 Application to Simulated and Real Data

In this section we compare regularized kernel sliced inverseregression (RKSIR) with several other SIR-related dimensionreductionmethodsThe comparisons are used to address twoquestions (1) does regularization improve the performanceof kernel sliced inverse regression and (2)does the nonlinear-ity of kernel sliced inverse regression improve the predictionaccuracy

We would like to remark that the assessment of nonlineardimension reduction methods could be more difficult thanthat of linear ones When the feature mapping 120601 for an RKHSis not available we do not know the true edr directions orsubspace So in that case we will use the prediction accuracyto evaluate the goodness of RKSIR

41 Importance of Nonlinearity and Regularization Our firstexample illustrates that both the nonlinearity and regulariza-tion of RKSIR can significantly improve prediction accuracy

The regression model has ten predictor variables 119883 =

(1198831 119883

10) with each one following a normal distribution

119883119894sim 119873(0 1) A univariate response is given as

119884 = (sin (1198831) + sin (119883

2)) (1 + sin (119883

3)) + 120576 (35)

with noise 120576 sim 119873(0 012) We compare the effectiveness

of the linear dimension reduction methods SIR SAVE andPHD with RKSIR by examining the predictive accuracy of anonlinear kernel regression model on the reduced space Wegenerate 100 training samples and apply the abovemethods to

6 Abstract and Applied Analysis

minus6 minus5 minus4 minus3 minus2 minus1 0 1 202

03

04

05

06

07

08

09

1M

ean

squa

re er

ror

d = 1

d = 2

log10

(s)

(a)

1 2 3 4 5 6 7 8 9 100

02

04

06

08

1

12

14

Number of edr directions

Mea

n sq

uare

erro

r

RKSIRSIR

SAVEPHD

(b)

minus2 minus15 minus1 minus05 0 05 1 15 2minus2

minus15

minus1

minus05

0

05

1

15

2

KSIR variates for the first nonlinear edr direction

sin(X

1)+

sin(X

2)

(c)

2

18

16

14

12

1

08

06

04

02

0minus15 minus1 minus05 0 05 1 15

KSIR variates for the second nonlinear edr direction

1+

sin(X

3)

(d)

Figure 1 Dimension reduction for model (35) (a) mean square error for RKSIR with different regularization parameters (b) mean squareerror for various dimension reduction methods The blue line represents the mean square error without using dimension reduction (c)-(d)RKSIR variates versus nonlinear factors

compute the edr directions In RKSIR we used an additiveGaussian kernel as follows

119896 (119909 119911) =

119889

sum

119895=1

exp(minus(119909119895minus 119911119895)2

21205902) (36)

where 120590 = 2 After projecting the training samples on theestimated edr directions we train a Gaussian kernel regres-sion model based on the new variatesThen the mean squareerror is computed on 1000 independent test samples Thisexperiment is repeated 100 times with all parameters set bycross-validation The results are summarized in Figure 1

Figure 1(a) displays the accuracy for RKSIR as a functionof the regularization parameter illustrating the importance of

selecting regularization parameters KSIRwithout regulariza-tion performs much worse than the RKSIR Figure 1(b) dis-plays the prediction accuracy of various dimension reductionmethods

RKSIR outperforms all the linear dimension reductionmethods which illustrates the power of nonlinearity intro-duced in RKSIR It also suggests that there are essentially twononlinear edr directions This observation seems to agreewith the model in (35) Indeed Figures 1(c) and 1(d) showthat the first two edr directions from RKSIR estimate thetwo nonlinear factors well

42 Effect of Regularization This example illustrates theeffect of regularization on the performance of KSIR as afunction of the anisotropy of the predictors

Abstract and Applied Analysis 7

The regression model has ten predictor variables 119883 =

(1198831 119883

10) and a univariate response specified by

119884 = 1198831+ 1198832

2+ 120576 120576 sim 119873 (0 01

2) (37)

where 119883 sim 119873(0 Σ119883) and Σ

119883= 119876Δ119876 with 119876 a randomly

chosen orthogonal matrix and Δ = diag(1120579 2120579 10120579) Wewill see that increasing the parameter 120579 isin [0infin) increasesthe anisotropy of the data that increases the difficulty ofidentifying the correct edr directions

For thismodel it is known that SIRwillmiss the directionalong the second variable119883

2 So we focus on the comparison

of KSIR and RKSIR in this exampleIf we use a second-order polynomial kernel 119896(119909 119911) =

(1 + 119909119879119911)2 that corresponds to the feature space

Φ (119883) = 1119883119894 (119883119894119883119895)119894le119895 119894 119895 = 1 10 (38)

then 1198831+ 1198832

2can be captured in one edr direction Ideally

the first KSIR variate V = ⟨1205731 120601(119883)⟩ should be equivalent to

1198831+ 1198832

2modulo shift and scale

V minus EV prop 1198831+ 1198832

2minus E (119883

1+ 1198832

2) (39)

So for this example given estimates of KSIR variates at the 119899data points V

119894119899

119894=1= ⟨1205731 120601(119909119894)⟩119899

119894=1 the error of the first edr

direction can bemeasured by the least squares fitting of Vwithrespect to (119883

1+ 1198832

2)

error = min119886119887isinR

1

119899

119899

sum

119894=1

(V119894minus (119886 (119909

1198941+ 1199092

1198942) + 119887))

2

(40)

We drew 200 observations from the model specifiedin (37) and then we applied the two dimension reductionmethods KSIR and RKSIR The mean and standard errorsof 100 repetitions of this procedure are reported in Figure 2The result shows that KSIR becomes more andmore unstableas 120579 increases and the regularization helps to reduce thisinstability

43 Importance of Nonlinearity and Regularization in RealData When SIR is applied to classification problems it isequivalent to a Fisher discriminant analysis For the case ofmulticlass classification it is natural to use SIR and considereach class as a slice Kernel forms of Fisher discriminantanalysis (KFDA) [8] have been used to construct nonlineardiscriminant surfaces and the regularization has improvedperformance of KFDA [25] In this example we show thatthis idea of adding a nonlinearity and a regularization termimproves predictive accuracy in a realmulticlass classificationdata set the classification of handwritten digits

The MNIST data set (Y LeCun httpyannlecuncomexdbmnist) contains 60 000 images of handwritten digits0 1 2 9 as training data and 10 000 images as test dataEach image consists of 119901 = 28 times 28 = 784 gray-scalepixel intensities It is commonly believed that there is clearnonlinear structure in this 784-dimensional space

minus05 0 05 1 15 2 250

100

200

300

400

500

600

700

120579 parameter

Erro

r rat

e

KSIRRKSIR

Figure 2 Error in edr as a function of 120579

We compared regularized SIR (RSIR) as in (26) KSIRand RKSIR on this data to examine the effect of regulariza-tion and nonlinearity Each draw of the training set consistedof 100 observations of each digit We then computed thetop 10 edr directions using these 1000 observations and10 slices one for each digit We projected the 10 000 testobservations onto the edr directions and used a k-nearestneighbor (kNN) classifier with 119896 = 5 to classify the test dataThe accuracy of the kNN classifier without dimensionreduction was used as a baseline For KSIR and RKSIR weused a Gaussian kernel with the bandwith parameter setas the median pairwise distance between observations Theregularization parameter was set by cross-validation

The mean and standard deviation of the classificationaccuracy over 100 iterations of this procedure are reportedin Table 1 The first interesting observation is that the lin-ear dimension reduction does not capture discriminativeinformation as the classification accuracywithout dimensionreduction is better Nonlinearity does increase classifica-tion accuracy and coupling regularization with nonlinearityincreases accuracymoreThis improvement is dramatic for 23 5 and 8

5 Discussion

The interest in manifold learning and nonlinear dimensionreduction in both statistics and machine learning has led toa variety of statistical models and algorithms However mostof these methods are developed in the unsupervised learningframework Therefore the estimated dimensions may notbe optimal for the regression models Our work incorpo-rates nonlinearity and regularization to inverse regressionapproaches and results in a robust response driven nonlineardimension reduction method

RKHS has also been introduced into supervised dimen-sion reduction in [26] where the conditional covariance

8 Abstract and Applied Analysis

Table 1 Mean and standard deviations for error rates in classification of digits

Digit RKSIR KSIR RSIR kNN0 00273 (00089) 00472 (00191) 00487 (00128) 00291 (00071)1 00150 (00049) 00177 (00051) 00292 (00113) 00052 (00012)2 01039 (00207) 01475 (00497) 01921 (00238) 02008 (00186)3 00845 (00208) 01279 (00494) 01723 (00283) 01092 (00130)4 00784 (00240) 01044 (00461) 01327 (00327) 01617 (00213)5 00877 (00209) 01327 (00540) 02146 (00294) 01419 (00193)6 00472 (00108) 00804 (00383) 00816 (00172) 00446 (00081)7 00887 (00169) 01119 (00357) 01354 (00172) 01140 (00125)8 00981 (00259) 01490 (00699) 01981 (00286) 01140 (00156)9 00774 (00251) 01095 (00398) 01533 (00212) 02006 (00153)Average 00708 (00105) 01016 (00190) 01358 (00093) 01177 (00039)

operators on such kernel spaces are used to characterize theconditional independence between linear projections and theresponse variable Therefore their method estimates linearedr subspaces while in [10 11] and in this paper RKHS isused to model the edr subspace which leads to nonlineardimension reduction

There are several open issues in regularized kernel SIRmethod such as the selection of kernels regularizationparameters and number of dimensions A direct assessmentof the nonlinear edr directions is expected to reduce thecomputational burden in procedures based on cross valida-tion While these are well established in linear dimensionreduction however little is known for nonlinear dimensionreduction We would like to leave them for future research

There are some interesting connections between KSIRand functional SIR which are developed by Ferre and hiscoauthors in a series of papers [14 17 19] In functional SIRthe observable data are functions and the goal is to find linearedr directions for functional data analysis In KSIR theobservable data are typically not functions but mapped into afunction space in order to characterize nonlinear structuresThis suggests that computations involved in functional SIRcan be simplified by a parametrization with respect to anRKHS or using a linear kernel in the parametrized functionspace On the other hand from a theoretical point of viewKSIR can be viewed as a special case of functional SIRalthough our current theoretical results on KSIR are differentfrom the ones for functional SIR

Appendices

A Proof of Theorem 3

Under the assumption of Proposition 2 for each 119884 = 119910

E [120601 (119883) | 119884 = 119910] isin span Σ120573119894 119894 = 1 119889 (A1)

Since Γ = cov[E(120601(119883) | 119884)] the rank of Γ (ie thedimension of the image of Γ) is less than 119889 which impliesthat Γ is compact With the fact that it is symmetric andsemipositive there exist 119889

Γpositive eigenvalues 120591

119894119889Γ

119894=1and

eigenvectors 119906119894119889Γ

119894=1 such that Γ = sum119889Γ

119894=1120591119894119906119894otimes 119906119894

Recall that for any 119891 isinH

Γ119891 = E ⟨E [120601 (119883) | 119884] 119891⟩E [120601 (119883) | 119884] (A2)

also belongs to

span Σ120573119894 119894 = 1 119889 sub range (Σ) (A3)

because of (A1) so

119906119894=1

120591119894

Γ119906119894isin range (Σ) (A4)

This proves (i)Since for each 119891 isin H Γ119891 isin range (Σminus1) the operator

119879 = Σminus1Γ is well defined over the whole space Moreover

119879119891 = Σminus1(

119889Γ

sum

119894=1

⟨119906119894 119891⟩ 119906119894)

=

119889Γ

sum

119894=1

⟨119906119894 119891⟩ Σ

minus1(119906119894) = (

119889Γ

sum

119894=1

Σminus1(119906119894) otimes 119906119894)119891

(A5)

This proves (ii)

B Proof of Proposition 7

We first prove the proposition for matrices to simplify thennotation we then extend the result to the operators where119889119870is infinite and a matrix form does not make senseLet Φ = [120601(119909

1) 120601(119909

119899)] It has the following SVD

decomposition

Φ = 119880119863119881119879

= [1199061sdot sdot sdot 119906119889119870] [

119863119889times119889

0119889times(119899minus119889)

0(119889119870minus119889)times119889

0(119889119870minus119889)times(119899minus119889)

][[

[

V1198791

V119879119899

]]

]

= 119880119863119881119879

(B1)

where 119880 = [1199061 119906

119889] 119881 = [V

1 V

119889] and 119863 = 119863

119889times119889is a

diagonal matrix of dimension 119889 le 119899

Abstract and Applied Analysis 9

We need to show the KSIR variates

V119895= 119888119879

119895[119896 (119909 119909

1) 119896 (119909 119909

119899)]

= 119888119879

119895Φ119879Φ (119909) = ⟨Φ119888119895 120601 (119909)⟩ = ⟨120573119895 120601 (119909)⟩

(B2)

It suffices to prove that if (120582 119888) is a solution to (28) then (120582 120573)is also a pair of eigenvalue and eigenvector of (Σ2 + 120574119868)minus1ΣΓand vice versa where 119888 and 120573 are related by

120573 = Φ119888 119888 = 119881119863119880119879

120573 (B3)

Noting that facts Σ = (1119899)ΦΦ119879 Γ = (1119899)Φ119869Φ119879 and 119870 =Φ119879Φ = 119881119863

2

119881119879 the argument may be made as follows

ΣΓ120573 = 120582 (Σ2+ 120574119868) 120573

lArrrArr ΦΦ119879Φ119869Φ119879Φ119888 = 120582 (ΦΦ

119879ΦΦ119879Φ119888 + 119899

2120574Φ119888)

lArrrArr Φ119870119869119870119888 = 120582Φ (1198702+ 1198992120574) 119888

(lArrrArr 119881119881119879

119870119869119870119888 = 120582119881119881119879

(1198702+ 1198992120574119868) 119888)

lArrrArr 119870119869119870119888 = 120582 (1198702+ 1198992120574119868) 119888

(B4)

Note that the implication in the third step is necessary only intherArr direction which is obtained by multiplying both sides119881119863minus1

119880119879 and using the facts119880119879119880 = 119868

119889 For the last step since

119881119879

119881 = 119868119889 we use the following facts

119881119881119879

119870 = 119881119881119879

1198811198632

119881119879

= 1198811198632

119881119879

= 119870

119881119881119879

119888 = 119881119881119879

119881119863minus1

119880119879

120573 = 119881119863minus1

119880119879

120573 = 119888

(B5)

In order for this result to hold rigorously when the RKHSis infinite dimensional we need to formally define Φ Φ119879and the SVD of Φ when 119889

119870is infinite For the infinite

dimensional case Φ is an operator from R119899 to H119870defined

by ΦV = sum119899119894=1

V119894120601(119909119894) for V = (V

1 V

119899)119879isin R119899 and Φ119879

is its adjoint an operator from H119870to R119899 such that Φ119879119891 =

(⟨120601(1199091) 119891⟩119870 ⟨120601(119909

1) 119891⟩119870)119879 for 119891 isin H

119870 The notions 119880

and 119880119879 are similarly definedThe above formulation ofΦ andΦ119879 coincides the defini-

tion of Σ as a covariance operator Since the rank of Σ is lessthan 119899 it is compact and has the following representation

Σ =

119889119870

sum

119894=1

119894119906119894otimes 119906119894=

119889

sum

119894=1

120590119894119906119894otimes 119906119894 (B6)

where 119889 le 119899 is the rank and 1205901ge 1205902ge sdot sdot sdot ge 120590

119889gt 120590119889+1=

sdot sdot sdot = 0 This implies that each 120601(119909119894) lies in span(119906

1 119906

119889)

and hence we can write 120601(119909119894) = 119880120591

119894 where 119880 = (119906

1 119906

119889)

should be considered as an operator fromR119889 toH119870and 120591119894isin

R119889 Denote Υ = (1205911 120591

119899)119879isin R119899times119889 It is easy to check that

Υ119879Υ = diag(119899120590

1 119899120590

119889) Let119863

119889times119889= diag(radic1198991205901 radic119899120590119889)

and119881 = Υ119863minus1Then we obtain the SVD forΦ asΦ = 119880119863119881119879

which is well defined

C Proof of Consistency

C1 Preliminaries In order to prove Theorems 9 and 10we use the properties of the Hilbert-Schmidt operatorscovariance operators for the Hilbert space valued randomvariables and the perturbation theory for linear operators Inthis subsection we provide a brief introduction to them Fordetails see [27ndash29] and references therein

Given a separable Hilbert space H of dimension 119901Ha linear operator 119871 on H is said to belong to the Hilbert-Schmidt class if

1198712

HS =

119901H

sum

119894=1

100381710038171003817100381711987111989011989410038171003817100381710038172

Hlt infin (C1)

where 119890119894 is an orthonormal basisTheHilbert-Schmidt class

forms a new Hilbert space with norm sdot HSGiven a bounded operator 119878 onH the operators 119878119871 and

119871119878 both belong to theHilbert-Schmidt class and the followingholds

119878119871HS le 119878 119871HS 119871119878HS le 119871HS 119878 (C2)

where sdot denotes the default operator norm

1198712= sup119891isinH

100381710038171003817100381711987111989110038171003817100381710038172

100381710038171003817100381711989110038171003817100381710038172 (C3)

Let 119885 be a random vector taking values in H satisfyingE1198852lt infin The covariance operator

Σ = E [(119885 minus E119885) otimes (119885 minus E119885)] (C4)

is self-adjoint positive compact and belongs to the Hilbert-Schmidt class

A well-known result from perturbation theory for linearoperators states that if a set of linear operators 119879

119899converges

to 119879 in the Hilbert-Schmidt norm and the eigenvalues of 119879are nondegenerate then the eigenvalues and eigenvectors of119879119899converge to those of 119879 with same rate or convergence as

the convergence of the operators

C2 Proof of Theorem 10 We will use the following resultfrom [14]

10038171003817100381710038171003817Σ minus Σ

10038171003817100381710038171003817HS = 119874119901 (1

radic119899)

10038171003817100381710038171003817Γ minus Γ

10038171003817100381710038171003817HS = 119874119901 (1

radic119899)

(C5)

To simplify the notion we denote 119904= (Σ2+ 119904119868)minus1

ΣΓ Alsodefine

1198791= (Σ2+ 119904119868)minus1

ΣΓ 1198792= (Σ2+ 119904119868)minus1

ΣΓ (C6)

Then10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS

le10038171003817100381710038171003817119904minus 1198791

10038171003817100381710038171003817HS +10038171003817100381710038171198791 minus 1198792

1003817100381710038171003817HS +10038171003817100381710038171198792 minus 119879

1003817100381710038171003817HS

(C7)

10 Abstract and Applied Analysis

For the first term observe that10038171003817100381710038171003817119904minus 1198791

10038171003817100381710038171003817HSle100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817

10038171003817100381710038171003817ΣΓ minus ΣΓ

10038171003817100381710038171003817HS= 119874119901(1

119904radic119899)

(C8)For the second term note that

1198791=

119889Γ

sum

119895=1

120591119895((Σ2+ 119904119868)minus1

Σ119906119895) otimes 119906119895

1198792=

119889Γ

sum

119895=1

120591119895((Σ2+ 119904119868)minus1

119906119895) otimes 119906119895

(C9)

Therefore

10038171003817100381710038171198791 minus 11987921003817100381710038171003817HS =

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1

minus (Σ2+ 119904119868)minus1

) Σ119906119895

100381710038171003817100381710038171003817

(C10)Since 119906

119895isin range(Σ) there exists

119895such that 119906

119895= Σ119895 Then

((Σ2+ 119904119868)minus1

minus (Σ2+ 119904119868)minus1

) Σ119906119895

= (Σ2+ 119904119868)minus1

(Σ2minus Σ2) (Σ2+ 119904119868)minus1

Σ2119895

(C11)

which implies

10038171003817100381710038171198791 minus 11987921003817100381710038171003817 le

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817

10038171003817100381710038171003817Σ2minus Σ210038171003817100381710038171003817HS

times100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ2100381710038171003817100381710038171003817

10038171003817100381710038171003817119895

10038171003817100381710038171003817

= 119874119901(1

119904radic119899)

(C12)

For the third term the following holds

10038171003817100381710038171198792 minus 11987910038171003817100381710038172

HS =

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1

Σ minus Σminus1) 119906119895

100381710038171003817100381710038171003817

2

(C13)

and for each 119895 = 1 119889Γ

100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ119906119895minus Σminus1119906119895

100381710038171003817100381710038171003817

le100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ2119895minus 119895

100381710038171003817100381710038171003817

=

1003817100381710038171003817100381710038171003817100381710038171003817

infin

sum

119894=1

(1199082

119895

119904 + 1199082119895

minus 1)⟨119895 119890119894⟩ 119890119894

1003817100381710038171003817100381710038171003817100381710038171003817

= (

infin

sum

119894=1

1199042

(119904 + 1199082119894)2⟨119895 119890119894⟩2

)

12

le119904

119908119873

(

119873

sum

119894=1

⟨119895 119890119894⟩2

)

12

+ (

infin

sum

119894=119873+1

⟨119895 119890119894⟩2

)

12

=119904

1199082119873

10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp

119873(119895)10038171003817100381710038171003817

(C14)

Combining these terms results in (33)

Since Πperp119873(119895) rarr 0 as119873 rarr infin consequently we have

10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS= 119900119901 (1) (C15)

if 119904 rarr 0 and 119904radic119899 rarr infinIf all the edr directions 120573

119894depend only on a finite

number of eigenvectors of the covariance operator then thereexist some 119873 gt 1 such that Slowast = spanΣ119890

119894 119894 = 1 119873

This implies that

119895= Σminus1119906119895isin Σminus1(Slowast) sub span 119890

119894 119894 = 1 119873 (C16)

Therefore Πperp119873(119895) = 0 Let 119904 = 119874(119899minus14) the rate is119874(11989914)

Acknowledgments

The authors acknowledge the support of the National Sci-ence Foundation (DMS-0732276 and DMS-0732260) and theNational Institutes ofHealth (P50GM081883) Any opinionsfindings and conclusions or recommendations expressed inthis work are those of the authors and do not necessarilyreflect the views of the NSF or NIH

References

[1] K Li ldquoSliced inverse regression for dimension reduction (withdiscussion)rdquo Journal of the American Statistical Association vol86 no 414 pp 316ndash342 1991

[2] N Duan and K Li ldquoSlicing regression a link-free regressionmethodrdquoTheAnnals of Statistics vol 19 no 2 pp 505ndash530 1991

[3] R Cook and S Weisberg ldquoCommentrdquo Journal of the AmericanStatistical Association vol 86 no 414 pp 328ndash332 1991

[4] K C Li ldquoOn principal hessian directions for data visualizationand dimension reduction another application of steinrsquos lemmardquoJournal of the American Statistical Association vol 87 no 420pp 1025ndash1039 1992

[5] L Li R D Cook and C-L Tsai ldquoPartial inverse regressionrdquoBiometrika vol 94 no 3 pp 615ndash625 2007

[6] B Scholkopf A J Smola and K Muller ldquoKernel principalcomponent analysisrdquo in Proceedings of the Artificial NeuralNetworks (ICANN rsquo97) W Gerstner A Germond M Haslerand J-D Nicoud Eds vol 1327 of Lecture Notes in ComputerScience pp 583ndash588 Springer Lausanne Switzerland October1997

[7] B Scholkopf and A J Smola Learning with Kernels The MITPress Massachusetts Mass USA 2002

[8] G Baudat and F Anouar ldquoGeneralized discriminant analysisusing a kernel approachrdquo Neural Computation vol 12 no 10pp 2385ndash2404 2000

[9] F R Bach and M I Jordan ldquoKernel independent componentanalysisrdquo Journal of Machine Learning Research vol 3 no 1 pp1ndash48 2002

[10] H-M Wu ldquoKernel sliced inverse regression with applicationsto classificationrdquo Journal of Computational and Graphical Statis-tics vol 17 no 3 pp 590ndash610 2008

[11] Y-R Yeh S-Y Huang and Y-J Lee ldquoNonlinear dimensionreduction with kernel sliced inverse regressionrdquo IEEE Trans-actions on Knowledge and Data Engineering vol 21 no 11 pp1590ndash1603 2009

Abstract and Applied Analysis 11

[12] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society of London A vol 209 no 441ndash458 pp 415ndash446 1909

[13] H Konig Eigenvalue Distribution of Compact Operators vol16 of Operator Theory Advances and Applications CH BaselSwitzerland 1986

[14] L Ferre and A Yao ldquoFunctional sliced inverse regression anal-ysisrdquo Statistics vol 37 no 6 pp 475ndash488 2003

[15] P Hall and K-C Li ldquoOn almost linearity of low-dimensionalprojections from high-dimensional datardquo The Annals of Statis-tics vol 21 no 2 pp 867ndash889 1993

[16] G He H-G Muller and J-L Wang ldquoFunctional canonicalanalysis for square integrable stochastic processesrdquo Journal ofMultivariate Analysis vol 85 no 1 pp 54ndash77 2003

[17] L Ferre and A Yao ldquoSmoothed functional inverse regressionrdquoStatistica Sinica vol 15 no 3 pp 665ndash683 2005

[18] W Zhong P Zeng P Ma J S Liu and Y Zhu ldquoRSIR regular-ized sliced inverse regression for motif discoveryrdquo Bioinformat-ics vol 21 no 22 pp 4169ndash4175 2005

[19] L Ferre and N Villa ldquoMultilayer perceptron with functionalinputs an inverse regression approachrdquo Scandinavian Journalof Statistics vol 33 no 4 pp 807ndash823 2006

[20] L Li andX Yin ldquoSliced inverse regressionwith regularizationsrdquoBiometrics vol 64 no 1 pp 124ndash131 2008

[21] C Bernard-Michel L Gardes and S Girard ldquoGaussian regular-ized sliced inverse regressionrdquo Statistics and Computing vol 19no 1 pp 85ndash98 2009

[22] T Hsing and R J Carroll ldquoAn asymptotic theory for slicedinverse regressionrdquo The Annals of Statistics vol 20 no 2 pp1040ndash1061 1992

[23] J Saracco ldquoAn asymptotic theory for sliced inverse regressionrdquoCommunications in Statistics vol 26 no 9 pp 2141ndash2171 1997

[24] L X Zhu and K W Ng ldquoAsymptotics of sliced inverse regres-sionrdquo Statistica Sinica vol 5 no 2 pp 727ndash736 1995

[25] T Kurita and T Taguchi ldquoA kernel-based fisher discriminantanalysis for face detectionrdquo IEICE Transactions on Informationand Systems vol 88 no 3 pp 628ndash635 2005

[26] K Fukumizu F R Bach and M I Jordan ldquoKernel dimensionreduction in regressionrdquo The Annals of Statistics vol 37 no 4pp 1871ndash1905 2009

[27] T Kato Perturbation Theory for Linear Operators SpringerBerlin Germany 1966

[28] F Chatelin Spectral Approximation of Linear Operators Aca-demic Press 1983

[29] G Blanchard O Bousquet and L Zwald ldquoStatistical propertiesof kernel principal component analysisrdquoMachine Learning vol66 no 2-3 pp 259ndash294 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 6: Research Article Kernel Sliced Inverse Regression ...downloads.hindawi.com/journals/aaa/2013/540725.pdf · high-dimensional space [ ]. is conforms to the argument in [ ] that the

6 Abstract and Applied Analysis

minus6 minus5 minus4 minus3 minus2 minus1 0 1 202

03

04

05

06

07

08

09

1M

ean

squa

re er

ror

d = 1

d = 2

log10

(s)

(a)

1 2 3 4 5 6 7 8 9 100

02

04

06

08

1

12

14

Number of edr directions

Mea

n sq

uare

erro

r

RKSIRSIR

SAVEPHD

(b)

minus2 minus15 minus1 minus05 0 05 1 15 2minus2

minus15

minus1

minus05

0

05

1

15

2

KSIR variates for the first nonlinear edr direction

sin(X

1)+

sin(X

2)

(c)

2

18

16

14

12

1

08

06

04

02

0minus15 minus1 minus05 0 05 1 15

KSIR variates for the second nonlinear edr direction

1+

sin(X

3)

(d)

Figure 1 Dimension reduction for model (35) (a) mean square error for RKSIR with different regularization parameters (b) mean squareerror for various dimension reduction methods The blue line represents the mean square error without using dimension reduction (c)-(d)RKSIR variates versus nonlinear factors

compute the edr directions In RKSIR we used an additiveGaussian kernel as follows

119896 (119909 119911) =

119889

sum

119895=1

exp(minus(119909119895minus 119911119895)2

21205902) (36)

where 120590 = 2 After projecting the training samples on theestimated edr directions we train a Gaussian kernel regres-sion model based on the new variatesThen the mean squareerror is computed on 1000 independent test samples Thisexperiment is repeated 100 times with all parameters set bycross-validation The results are summarized in Figure 1

Figure 1(a) displays the accuracy for RKSIR as a functionof the regularization parameter illustrating the importance of

selecting regularization parameters KSIRwithout regulariza-tion performs much worse than the RKSIR Figure 1(b) dis-plays the prediction accuracy of various dimension reductionmethods

RKSIR outperforms all the linear dimension reductionmethods which illustrates the power of nonlinearity intro-duced in RKSIR It also suggests that there are essentially twononlinear edr directions This observation seems to agreewith the model in (35) Indeed Figures 1(c) and 1(d) showthat the first two edr directions from RKSIR estimate thetwo nonlinear factors well

42 Effect of Regularization This example illustrates theeffect of regularization on the performance of KSIR as afunction of the anisotropy of the predictors

Abstract and Applied Analysis 7

The regression model has ten predictor variables 119883 =

(1198831 119883

10) and a univariate response specified by

119884 = 1198831+ 1198832

2+ 120576 120576 sim 119873 (0 01

2) (37)

where 119883 sim 119873(0 Σ119883) and Σ

119883= 119876Δ119876 with 119876 a randomly

chosen orthogonal matrix and Δ = diag(1120579 2120579 10120579) Wewill see that increasing the parameter 120579 isin [0infin) increasesthe anisotropy of the data that increases the difficulty ofidentifying the correct edr directions

For thismodel it is known that SIRwillmiss the directionalong the second variable119883

2 So we focus on the comparison

of KSIR and RKSIR in this exampleIf we use a second-order polynomial kernel 119896(119909 119911) =

(1 + 119909119879119911)2 that corresponds to the feature space

Φ (119883) = 1119883119894 (119883119894119883119895)119894le119895 119894 119895 = 1 10 (38)

then 1198831+ 1198832

2can be captured in one edr direction Ideally

the first KSIR variate V = ⟨1205731 120601(119883)⟩ should be equivalent to

1198831+ 1198832

2modulo shift and scale

V minus EV prop 1198831+ 1198832

2minus E (119883

1+ 1198832

2) (39)

So for this example given estimates of KSIR variates at the 119899data points V

119894119899

119894=1= ⟨1205731 120601(119909119894)⟩119899

119894=1 the error of the first edr

direction can bemeasured by the least squares fitting of Vwithrespect to (119883

1+ 1198832

2)

error = min119886119887isinR

1

119899

119899

sum

119894=1

(V119894minus (119886 (119909

1198941+ 1199092

1198942) + 119887))

2

(40)

We drew 200 observations from the model specifiedin (37) and then we applied the two dimension reductionmethods KSIR and RKSIR The mean and standard errorsof 100 repetitions of this procedure are reported in Figure 2The result shows that KSIR becomes more andmore unstableas 120579 increases and the regularization helps to reduce thisinstability

43 Importance of Nonlinearity and Regularization in RealData When SIR is applied to classification problems it isequivalent to a Fisher discriminant analysis For the case ofmulticlass classification it is natural to use SIR and considereach class as a slice Kernel forms of Fisher discriminantanalysis (KFDA) [8] have been used to construct nonlineardiscriminant surfaces and the regularization has improvedperformance of KFDA [25] In this example we show thatthis idea of adding a nonlinearity and a regularization termimproves predictive accuracy in a realmulticlass classificationdata set the classification of handwritten digits

The MNIST data set (Y LeCun httpyannlecuncomexdbmnist) contains 60 000 images of handwritten digits0 1 2 9 as training data and 10 000 images as test dataEach image consists of 119901 = 28 times 28 = 784 gray-scalepixel intensities It is commonly believed that there is clearnonlinear structure in this 784-dimensional space

minus05 0 05 1 15 2 250

100

200

300

400

500

600

700

120579 parameter

Erro

r rat

e

KSIRRKSIR

Figure 2 Error in edr as a function of 120579

We compared regularized SIR (RSIR) as in (26) KSIRand RKSIR on this data to examine the effect of regulariza-tion and nonlinearity Each draw of the training set consistedof 100 observations of each digit We then computed thetop 10 edr directions using these 1000 observations and10 slices one for each digit We projected the 10 000 testobservations onto the edr directions and used a k-nearestneighbor (kNN) classifier with 119896 = 5 to classify the test dataThe accuracy of the kNN classifier without dimensionreduction was used as a baseline For KSIR and RKSIR weused a Gaussian kernel with the bandwith parameter setas the median pairwise distance between observations Theregularization parameter was set by cross-validation

The mean and standard deviation of the classificationaccuracy over 100 iterations of this procedure are reportedin Table 1 The first interesting observation is that the lin-ear dimension reduction does not capture discriminativeinformation as the classification accuracywithout dimensionreduction is better Nonlinearity does increase classifica-tion accuracy and coupling regularization with nonlinearityincreases accuracymoreThis improvement is dramatic for 23 5 and 8

5 Discussion

The interest in manifold learning and nonlinear dimensionreduction in both statistics and machine learning has led toa variety of statistical models and algorithms However mostof these methods are developed in the unsupervised learningframework Therefore the estimated dimensions may notbe optimal for the regression models Our work incorpo-rates nonlinearity and regularization to inverse regressionapproaches and results in a robust response driven nonlineardimension reduction method

RKHS has also been introduced into supervised dimen-sion reduction in [26] where the conditional covariance

8 Abstract and Applied Analysis

Table 1 Mean and standard deviations for error rates in classification of digits

Digit RKSIR KSIR RSIR kNN0 00273 (00089) 00472 (00191) 00487 (00128) 00291 (00071)1 00150 (00049) 00177 (00051) 00292 (00113) 00052 (00012)2 01039 (00207) 01475 (00497) 01921 (00238) 02008 (00186)3 00845 (00208) 01279 (00494) 01723 (00283) 01092 (00130)4 00784 (00240) 01044 (00461) 01327 (00327) 01617 (00213)5 00877 (00209) 01327 (00540) 02146 (00294) 01419 (00193)6 00472 (00108) 00804 (00383) 00816 (00172) 00446 (00081)7 00887 (00169) 01119 (00357) 01354 (00172) 01140 (00125)8 00981 (00259) 01490 (00699) 01981 (00286) 01140 (00156)9 00774 (00251) 01095 (00398) 01533 (00212) 02006 (00153)Average 00708 (00105) 01016 (00190) 01358 (00093) 01177 (00039)

operators on such kernel spaces are used to characterize theconditional independence between linear projections and theresponse variable Therefore their method estimates linearedr subspaces while in [10 11] and in this paper RKHS isused to model the edr subspace which leads to nonlineardimension reduction

There are several open issues in regularized kernel SIRmethod such as the selection of kernels regularizationparameters and number of dimensions A direct assessmentof the nonlinear edr directions is expected to reduce thecomputational burden in procedures based on cross valida-tion While these are well established in linear dimensionreduction however little is known for nonlinear dimensionreduction We would like to leave them for future research

There are some interesting connections between KSIRand functional SIR which are developed by Ferre and hiscoauthors in a series of papers [14 17 19] In functional SIRthe observable data are functions and the goal is to find linearedr directions for functional data analysis In KSIR theobservable data are typically not functions but mapped into afunction space in order to characterize nonlinear structuresThis suggests that computations involved in functional SIRcan be simplified by a parametrization with respect to anRKHS or using a linear kernel in the parametrized functionspace On the other hand from a theoretical point of viewKSIR can be viewed as a special case of functional SIRalthough our current theoretical results on KSIR are differentfrom the ones for functional SIR

Appendices

A Proof of Theorem 3

Under the assumption of Proposition 2 for each 119884 = 119910

E [120601 (119883) | 119884 = 119910] isin span Σ120573119894 119894 = 1 119889 (A1)

Since Γ = cov[E(120601(119883) | 119884)] the rank of Γ (ie thedimension of the image of Γ) is less than 119889 which impliesthat Γ is compact With the fact that it is symmetric andsemipositive there exist 119889

Γpositive eigenvalues 120591

119894119889Γ

119894=1and

eigenvectors 119906119894119889Γ

119894=1 such that Γ = sum119889Γ

119894=1120591119894119906119894otimes 119906119894

Recall that for any 119891 isinH

Γ119891 = E ⟨E [120601 (119883) | 119884] 119891⟩E [120601 (119883) | 119884] (A2)

also belongs to

span Σ120573119894 119894 = 1 119889 sub range (Σ) (A3)

because of (A1) so

119906119894=1

120591119894

Γ119906119894isin range (Σ) (A4)

This proves (i)Since for each 119891 isin H Γ119891 isin range (Σminus1) the operator

119879 = Σminus1Γ is well defined over the whole space Moreover

119879119891 = Σminus1(

119889Γ

sum

119894=1

⟨119906119894 119891⟩ 119906119894)

=

119889Γ

sum

119894=1

⟨119906119894 119891⟩ Σ

minus1(119906119894) = (

119889Γ

sum

119894=1

Σminus1(119906119894) otimes 119906119894)119891

(A5)

This proves (ii)

B Proof of Proposition 7

We first prove the proposition for matrices to simplify thennotation we then extend the result to the operators where119889119870is infinite and a matrix form does not make senseLet Φ = [120601(119909

1) 120601(119909

119899)] It has the following SVD

decomposition

Φ = 119880119863119881119879

= [1199061sdot sdot sdot 119906119889119870] [

119863119889times119889

0119889times(119899minus119889)

0(119889119870minus119889)times119889

0(119889119870minus119889)times(119899minus119889)

][[

[

V1198791

V119879119899

]]

]

= 119880119863119881119879

(B1)

where 119880 = [1199061 119906

119889] 119881 = [V

1 V

119889] and 119863 = 119863

119889times119889is a

diagonal matrix of dimension 119889 le 119899

Abstract and Applied Analysis 9

We need to show the KSIR variates

V119895= 119888119879

119895[119896 (119909 119909

1) 119896 (119909 119909

119899)]

= 119888119879

119895Φ119879Φ (119909) = ⟨Φ119888119895 120601 (119909)⟩ = ⟨120573119895 120601 (119909)⟩

(B2)

It suffices to prove that if (120582 119888) is a solution to (28) then (120582 120573)is also a pair of eigenvalue and eigenvector of (Σ2 + 120574119868)minus1ΣΓand vice versa where 119888 and 120573 are related by

120573 = Φ119888 119888 = 119881119863119880119879

120573 (B3)

Noting that facts Σ = (1119899)ΦΦ119879 Γ = (1119899)Φ119869Φ119879 and 119870 =Φ119879Φ = 119881119863

2

119881119879 the argument may be made as follows

ΣΓ120573 = 120582 (Σ2+ 120574119868) 120573

lArrrArr ΦΦ119879Φ119869Φ119879Φ119888 = 120582 (ΦΦ

119879ΦΦ119879Φ119888 + 119899

2120574Φ119888)

lArrrArr Φ119870119869119870119888 = 120582Φ (1198702+ 1198992120574) 119888

(lArrrArr 119881119881119879

119870119869119870119888 = 120582119881119881119879

(1198702+ 1198992120574119868) 119888)

lArrrArr 119870119869119870119888 = 120582 (1198702+ 1198992120574119868) 119888

(B4)

Note that the implication in the third step is necessary only intherArr direction which is obtained by multiplying both sides119881119863minus1

119880119879 and using the facts119880119879119880 = 119868

119889 For the last step since

119881119879

119881 = 119868119889 we use the following facts

119881119881119879

119870 = 119881119881119879

1198811198632

119881119879

= 1198811198632

119881119879

= 119870

119881119881119879

119888 = 119881119881119879

119881119863minus1

119880119879

120573 = 119881119863minus1

119880119879

120573 = 119888

(B5)

In order for this result to hold rigorously when the RKHSis infinite dimensional we need to formally define Φ Φ119879and the SVD of Φ when 119889

119870is infinite For the infinite

dimensional case Φ is an operator from R119899 to H119870defined

by ΦV = sum119899119894=1

V119894120601(119909119894) for V = (V

1 V

119899)119879isin R119899 and Φ119879

is its adjoint an operator from H119870to R119899 such that Φ119879119891 =

(⟨120601(1199091) 119891⟩119870 ⟨120601(119909

1) 119891⟩119870)119879 for 119891 isin H

119870 The notions 119880

and 119880119879 are similarly definedThe above formulation ofΦ andΦ119879 coincides the defini-

tion of Σ as a covariance operator Since the rank of Σ is lessthan 119899 it is compact and has the following representation

Σ =

119889119870

sum

119894=1

119894119906119894otimes 119906119894=

119889

sum

119894=1

120590119894119906119894otimes 119906119894 (B6)

where 119889 le 119899 is the rank and 1205901ge 1205902ge sdot sdot sdot ge 120590

119889gt 120590119889+1=

sdot sdot sdot = 0 This implies that each 120601(119909119894) lies in span(119906

1 119906

119889)

and hence we can write 120601(119909119894) = 119880120591

119894 where 119880 = (119906

1 119906

119889)

should be considered as an operator fromR119889 toH119870and 120591119894isin

R119889 Denote Υ = (1205911 120591

119899)119879isin R119899times119889 It is easy to check that

Υ119879Υ = diag(119899120590

1 119899120590

119889) Let119863

119889times119889= diag(radic1198991205901 radic119899120590119889)

and119881 = Υ119863minus1Then we obtain the SVD forΦ asΦ = 119880119863119881119879

which is well defined

C Proof of Consistency

C1 Preliminaries In order to prove Theorems 9 and 10we use the properties of the Hilbert-Schmidt operatorscovariance operators for the Hilbert space valued randomvariables and the perturbation theory for linear operators Inthis subsection we provide a brief introduction to them Fordetails see [27ndash29] and references therein

Given a separable Hilbert space H of dimension 119901Ha linear operator 119871 on H is said to belong to the Hilbert-Schmidt class if

1198712

HS =

119901H

sum

119894=1

100381710038171003817100381711987111989011989410038171003817100381710038172

Hlt infin (C1)

where 119890119894 is an orthonormal basisTheHilbert-Schmidt class

forms a new Hilbert space with norm sdot HSGiven a bounded operator 119878 onH the operators 119878119871 and

119871119878 both belong to theHilbert-Schmidt class and the followingholds

119878119871HS le 119878 119871HS 119871119878HS le 119871HS 119878 (C2)

where sdot denotes the default operator norm

1198712= sup119891isinH

100381710038171003817100381711987111989110038171003817100381710038172

100381710038171003817100381711989110038171003817100381710038172 (C3)

Let 119885 be a random vector taking values in H satisfyingE1198852lt infin The covariance operator

Σ = E [(119885 minus E119885) otimes (119885 minus E119885)] (C4)

is self-adjoint positive compact and belongs to the Hilbert-Schmidt class

A well-known result from perturbation theory for linearoperators states that if a set of linear operators 119879

119899converges

to 119879 in the Hilbert-Schmidt norm and the eigenvalues of 119879are nondegenerate then the eigenvalues and eigenvectors of119879119899converge to those of 119879 with same rate or convergence as

the convergence of the operators

C2 Proof of Theorem 10 We will use the following resultfrom [14]

10038171003817100381710038171003817Σ minus Σ

10038171003817100381710038171003817HS = 119874119901 (1

radic119899)

10038171003817100381710038171003817Γ minus Γ

10038171003817100381710038171003817HS = 119874119901 (1

radic119899)

(C5)

To simplify the notion we denote 119904= (Σ2+ 119904119868)minus1

ΣΓ Alsodefine

1198791= (Σ2+ 119904119868)minus1

ΣΓ 1198792= (Σ2+ 119904119868)minus1

ΣΓ (C6)

Then10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS

le10038171003817100381710038171003817119904minus 1198791

10038171003817100381710038171003817HS +10038171003817100381710038171198791 minus 1198792

1003817100381710038171003817HS +10038171003817100381710038171198792 minus 119879

1003817100381710038171003817HS

(C7)

10 Abstract and Applied Analysis

For the first term observe that10038171003817100381710038171003817119904minus 1198791

10038171003817100381710038171003817HSle100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817

10038171003817100381710038171003817ΣΓ minus ΣΓ

10038171003817100381710038171003817HS= 119874119901(1

119904radic119899)

(C8)For the second term note that

1198791=

119889Γ

sum

119895=1

120591119895((Σ2+ 119904119868)minus1

Σ119906119895) otimes 119906119895

1198792=

119889Γ

sum

119895=1

120591119895((Σ2+ 119904119868)minus1

119906119895) otimes 119906119895

(C9)

Therefore

10038171003817100381710038171198791 minus 11987921003817100381710038171003817HS =

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1

minus (Σ2+ 119904119868)minus1

) Σ119906119895

100381710038171003817100381710038171003817

(C10)Since 119906

119895isin range(Σ) there exists

119895such that 119906

119895= Σ119895 Then

((Σ2+ 119904119868)minus1

minus (Σ2+ 119904119868)minus1

) Σ119906119895

= (Σ2+ 119904119868)minus1

(Σ2minus Σ2) (Σ2+ 119904119868)minus1

Σ2119895

(C11)

which implies

10038171003817100381710038171198791 minus 11987921003817100381710038171003817 le

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817

10038171003817100381710038171003817Σ2minus Σ210038171003817100381710038171003817HS

times100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ2100381710038171003817100381710038171003817

10038171003817100381710038171003817119895

10038171003817100381710038171003817

= 119874119901(1

119904radic119899)

(C12)

For the third term the following holds

10038171003817100381710038171198792 minus 11987910038171003817100381710038172

HS =

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1

Σ minus Σminus1) 119906119895

100381710038171003817100381710038171003817

2

(C13)

and for each 119895 = 1 119889Γ

100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ119906119895minus Σminus1119906119895

100381710038171003817100381710038171003817

le100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ2119895minus 119895

100381710038171003817100381710038171003817

=

1003817100381710038171003817100381710038171003817100381710038171003817

infin

sum

119894=1

(1199082

119895

119904 + 1199082119895

minus 1)⟨119895 119890119894⟩ 119890119894

1003817100381710038171003817100381710038171003817100381710038171003817

= (

infin

sum

119894=1

1199042

(119904 + 1199082119894)2⟨119895 119890119894⟩2

)

12

le119904

119908119873

(

119873

sum

119894=1

⟨119895 119890119894⟩2

)

12

+ (

infin

sum

119894=119873+1

⟨119895 119890119894⟩2

)

12

=119904

1199082119873

10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp

119873(119895)10038171003817100381710038171003817

(C14)

Combining these terms results in (33)

Since Πperp119873(119895) rarr 0 as119873 rarr infin consequently we have

10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS= 119900119901 (1) (C15)

if 119904 rarr 0 and 119904radic119899 rarr infinIf all the edr directions 120573

119894depend only on a finite

number of eigenvectors of the covariance operator then thereexist some 119873 gt 1 such that Slowast = spanΣ119890

119894 119894 = 1 119873

This implies that

119895= Σminus1119906119895isin Σminus1(Slowast) sub span 119890

119894 119894 = 1 119873 (C16)

Therefore Πperp119873(119895) = 0 Let 119904 = 119874(119899minus14) the rate is119874(11989914)

Acknowledgments

The authors acknowledge the support of the National Sci-ence Foundation (DMS-0732276 and DMS-0732260) and theNational Institutes ofHealth (P50GM081883) Any opinionsfindings and conclusions or recommendations expressed inthis work are those of the authors and do not necessarilyreflect the views of the NSF or NIH

References

[1] K Li ldquoSliced inverse regression for dimension reduction (withdiscussion)rdquo Journal of the American Statistical Association vol86 no 414 pp 316ndash342 1991

[2] N Duan and K Li ldquoSlicing regression a link-free regressionmethodrdquoTheAnnals of Statistics vol 19 no 2 pp 505ndash530 1991

[3] R Cook and S Weisberg ldquoCommentrdquo Journal of the AmericanStatistical Association vol 86 no 414 pp 328ndash332 1991

[4] K C Li ldquoOn principal hessian directions for data visualizationand dimension reduction another application of steinrsquos lemmardquoJournal of the American Statistical Association vol 87 no 420pp 1025ndash1039 1992

[5] L Li R D Cook and C-L Tsai ldquoPartial inverse regressionrdquoBiometrika vol 94 no 3 pp 615ndash625 2007

[6] B Scholkopf A J Smola and K Muller ldquoKernel principalcomponent analysisrdquo in Proceedings of the Artificial NeuralNetworks (ICANN rsquo97) W Gerstner A Germond M Haslerand J-D Nicoud Eds vol 1327 of Lecture Notes in ComputerScience pp 583ndash588 Springer Lausanne Switzerland October1997

[7] B Scholkopf and A J Smola Learning with Kernels The MITPress Massachusetts Mass USA 2002

[8] G Baudat and F Anouar ldquoGeneralized discriminant analysisusing a kernel approachrdquo Neural Computation vol 12 no 10pp 2385ndash2404 2000

[9] F R Bach and M I Jordan ldquoKernel independent componentanalysisrdquo Journal of Machine Learning Research vol 3 no 1 pp1ndash48 2002

[10] H-M Wu ldquoKernel sliced inverse regression with applicationsto classificationrdquo Journal of Computational and Graphical Statis-tics vol 17 no 3 pp 590ndash610 2008

[11] Y-R Yeh S-Y Huang and Y-J Lee ldquoNonlinear dimensionreduction with kernel sliced inverse regressionrdquo IEEE Trans-actions on Knowledge and Data Engineering vol 21 no 11 pp1590ndash1603 2009

Abstract and Applied Analysis 11

[12] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society of London A vol 209 no 441ndash458 pp 415ndash446 1909

[13] H Konig Eigenvalue Distribution of Compact Operators vol16 of Operator Theory Advances and Applications CH BaselSwitzerland 1986

[14] L Ferre and A Yao ldquoFunctional sliced inverse regression anal-ysisrdquo Statistics vol 37 no 6 pp 475ndash488 2003

[15] P Hall and K-C Li ldquoOn almost linearity of low-dimensionalprojections from high-dimensional datardquo The Annals of Statis-tics vol 21 no 2 pp 867ndash889 1993

[16] G He H-G Muller and J-L Wang ldquoFunctional canonicalanalysis for square integrable stochastic processesrdquo Journal ofMultivariate Analysis vol 85 no 1 pp 54ndash77 2003

[17] L Ferre and A Yao ldquoSmoothed functional inverse regressionrdquoStatistica Sinica vol 15 no 3 pp 665ndash683 2005

[18] W Zhong P Zeng P Ma J S Liu and Y Zhu ldquoRSIR regular-ized sliced inverse regression for motif discoveryrdquo Bioinformat-ics vol 21 no 22 pp 4169ndash4175 2005

[19] L Ferre and N Villa ldquoMultilayer perceptron with functionalinputs an inverse regression approachrdquo Scandinavian Journalof Statistics vol 33 no 4 pp 807ndash823 2006

[20] L Li andX Yin ldquoSliced inverse regressionwith regularizationsrdquoBiometrics vol 64 no 1 pp 124ndash131 2008

[21] C Bernard-Michel L Gardes and S Girard ldquoGaussian regular-ized sliced inverse regressionrdquo Statistics and Computing vol 19no 1 pp 85ndash98 2009

[22] T Hsing and R J Carroll ldquoAn asymptotic theory for slicedinverse regressionrdquo The Annals of Statistics vol 20 no 2 pp1040ndash1061 1992

[23] J Saracco ldquoAn asymptotic theory for sliced inverse regressionrdquoCommunications in Statistics vol 26 no 9 pp 2141ndash2171 1997

[24] L X Zhu and K W Ng ldquoAsymptotics of sliced inverse regres-sionrdquo Statistica Sinica vol 5 no 2 pp 727ndash736 1995

[25] T Kurita and T Taguchi ldquoA kernel-based fisher discriminantanalysis for face detectionrdquo IEICE Transactions on Informationand Systems vol 88 no 3 pp 628ndash635 2005

[26] K Fukumizu F R Bach and M I Jordan ldquoKernel dimensionreduction in regressionrdquo The Annals of Statistics vol 37 no 4pp 1871ndash1905 2009

[27] T Kato Perturbation Theory for Linear Operators SpringerBerlin Germany 1966

[28] F Chatelin Spectral Approximation of Linear Operators Aca-demic Press 1983

[29] G Blanchard O Bousquet and L Zwald ldquoStatistical propertiesof kernel principal component analysisrdquoMachine Learning vol66 no 2-3 pp 259ndash294 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 7: Research Article Kernel Sliced Inverse Regression ...downloads.hindawi.com/journals/aaa/2013/540725.pdf · high-dimensional space [ ]. is conforms to the argument in [ ] that the

Abstract and Applied Analysis 7

The regression model has ten predictor variables 119883 =

(1198831 119883

10) and a univariate response specified by

119884 = 1198831+ 1198832

2+ 120576 120576 sim 119873 (0 01

2) (37)

where 119883 sim 119873(0 Σ119883) and Σ

119883= 119876Δ119876 with 119876 a randomly

chosen orthogonal matrix and Δ = diag(1120579 2120579 10120579) Wewill see that increasing the parameter 120579 isin [0infin) increasesthe anisotropy of the data that increases the difficulty ofidentifying the correct edr directions

For thismodel it is known that SIRwillmiss the directionalong the second variable119883

2 So we focus on the comparison

of KSIR and RKSIR in this exampleIf we use a second-order polynomial kernel 119896(119909 119911) =

(1 + 119909119879119911)2 that corresponds to the feature space

Φ (119883) = 1119883119894 (119883119894119883119895)119894le119895 119894 119895 = 1 10 (38)

then 1198831+ 1198832

2can be captured in one edr direction Ideally

the first KSIR variate V = ⟨1205731 120601(119883)⟩ should be equivalent to

1198831+ 1198832

2modulo shift and scale

V minus EV prop 1198831+ 1198832

2minus E (119883

1+ 1198832

2) (39)

So for this example given estimates of KSIR variates at the 119899data points V

119894119899

119894=1= ⟨1205731 120601(119909119894)⟩119899

119894=1 the error of the first edr

direction can bemeasured by the least squares fitting of Vwithrespect to (119883

1+ 1198832

2)

error = min119886119887isinR

1

119899

119899

sum

119894=1

(V119894minus (119886 (119909

1198941+ 1199092

1198942) + 119887))

2

(40)

We drew 200 observations from the model specifiedin (37) and then we applied the two dimension reductionmethods KSIR and RKSIR The mean and standard errorsof 100 repetitions of this procedure are reported in Figure 2The result shows that KSIR becomes more andmore unstableas 120579 increases and the regularization helps to reduce thisinstability

43 Importance of Nonlinearity and Regularization in RealData When SIR is applied to classification problems it isequivalent to a Fisher discriminant analysis For the case ofmulticlass classification it is natural to use SIR and considereach class as a slice Kernel forms of Fisher discriminantanalysis (KFDA) [8] have been used to construct nonlineardiscriminant surfaces and the regularization has improvedperformance of KFDA [25] In this example we show thatthis idea of adding a nonlinearity and a regularization termimproves predictive accuracy in a realmulticlass classificationdata set the classification of handwritten digits

The MNIST data set (Y LeCun httpyannlecuncomexdbmnist) contains 60 000 images of handwritten digits0 1 2 9 as training data and 10 000 images as test dataEach image consists of 119901 = 28 times 28 = 784 gray-scalepixel intensities It is commonly believed that there is clearnonlinear structure in this 784-dimensional space

minus05 0 05 1 15 2 250

100

200

300

400

500

600

700

120579 parameter

Erro

r rat

e

KSIRRKSIR

Figure 2 Error in edr as a function of 120579

We compared regularized SIR (RSIR) as in (26) KSIRand RKSIR on this data to examine the effect of regulariza-tion and nonlinearity Each draw of the training set consistedof 100 observations of each digit We then computed thetop 10 edr directions using these 1000 observations and10 slices one for each digit We projected the 10 000 testobservations onto the edr directions and used a k-nearestneighbor (kNN) classifier with 119896 = 5 to classify the test dataThe accuracy of the kNN classifier without dimensionreduction was used as a baseline For KSIR and RKSIR weused a Gaussian kernel with the bandwith parameter setas the median pairwise distance between observations Theregularization parameter was set by cross-validation

The mean and standard deviation of the classificationaccuracy over 100 iterations of this procedure are reportedin Table 1 The first interesting observation is that the lin-ear dimension reduction does not capture discriminativeinformation as the classification accuracywithout dimensionreduction is better Nonlinearity does increase classifica-tion accuracy and coupling regularization with nonlinearityincreases accuracymoreThis improvement is dramatic for 23 5 and 8

5 Discussion

The interest in manifold learning and nonlinear dimensionreduction in both statistics and machine learning has led toa variety of statistical models and algorithms However mostof these methods are developed in the unsupervised learningframework Therefore the estimated dimensions may notbe optimal for the regression models Our work incorpo-rates nonlinearity and regularization to inverse regressionapproaches and results in a robust response driven nonlineardimension reduction method

RKHS has also been introduced into supervised dimen-sion reduction in [26] where the conditional covariance

8 Abstract and Applied Analysis

Table 1 Mean and standard deviations for error rates in classification of digits

Digit RKSIR KSIR RSIR kNN0 00273 (00089) 00472 (00191) 00487 (00128) 00291 (00071)1 00150 (00049) 00177 (00051) 00292 (00113) 00052 (00012)2 01039 (00207) 01475 (00497) 01921 (00238) 02008 (00186)3 00845 (00208) 01279 (00494) 01723 (00283) 01092 (00130)4 00784 (00240) 01044 (00461) 01327 (00327) 01617 (00213)5 00877 (00209) 01327 (00540) 02146 (00294) 01419 (00193)6 00472 (00108) 00804 (00383) 00816 (00172) 00446 (00081)7 00887 (00169) 01119 (00357) 01354 (00172) 01140 (00125)8 00981 (00259) 01490 (00699) 01981 (00286) 01140 (00156)9 00774 (00251) 01095 (00398) 01533 (00212) 02006 (00153)Average 00708 (00105) 01016 (00190) 01358 (00093) 01177 (00039)

operators on such kernel spaces are used to characterize theconditional independence between linear projections and theresponse variable Therefore their method estimates linearedr subspaces while in [10 11] and in this paper RKHS isused to model the edr subspace which leads to nonlineardimension reduction

There are several open issues in regularized kernel SIRmethod such as the selection of kernels regularizationparameters and number of dimensions A direct assessmentof the nonlinear edr directions is expected to reduce thecomputational burden in procedures based on cross valida-tion While these are well established in linear dimensionreduction however little is known for nonlinear dimensionreduction We would like to leave them for future research

There are some interesting connections between KSIRand functional SIR which are developed by Ferre and hiscoauthors in a series of papers [14 17 19] In functional SIRthe observable data are functions and the goal is to find linearedr directions for functional data analysis In KSIR theobservable data are typically not functions but mapped into afunction space in order to characterize nonlinear structuresThis suggests that computations involved in functional SIRcan be simplified by a parametrization with respect to anRKHS or using a linear kernel in the parametrized functionspace On the other hand from a theoretical point of viewKSIR can be viewed as a special case of functional SIRalthough our current theoretical results on KSIR are differentfrom the ones for functional SIR

Appendices

A Proof of Theorem 3

Under the assumption of Proposition 2 for each 119884 = 119910

E [120601 (119883) | 119884 = 119910] isin span Σ120573119894 119894 = 1 119889 (A1)

Since Γ = cov[E(120601(119883) | 119884)] the rank of Γ (ie thedimension of the image of Γ) is less than 119889 which impliesthat Γ is compact With the fact that it is symmetric andsemipositive there exist 119889

Γpositive eigenvalues 120591

119894119889Γ

119894=1and

eigenvectors 119906119894119889Γ

119894=1 such that Γ = sum119889Γ

119894=1120591119894119906119894otimes 119906119894

Recall that for any 119891 isinH

Γ119891 = E ⟨E [120601 (119883) | 119884] 119891⟩E [120601 (119883) | 119884] (A2)

also belongs to

span Σ120573119894 119894 = 1 119889 sub range (Σ) (A3)

because of (A1) so

119906119894=1

120591119894

Γ119906119894isin range (Σ) (A4)

This proves (i)Since for each 119891 isin H Γ119891 isin range (Σminus1) the operator

119879 = Σminus1Γ is well defined over the whole space Moreover

119879119891 = Σminus1(

119889Γ

sum

119894=1

⟨119906119894 119891⟩ 119906119894)

=

119889Γ

sum

119894=1

⟨119906119894 119891⟩ Σ

minus1(119906119894) = (

119889Γ

sum

119894=1

Σminus1(119906119894) otimes 119906119894)119891

(A5)

This proves (ii)

B Proof of Proposition 7

We first prove the proposition for matrices to simplify thennotation we then extend the result to the operators where119889119870is infinite and a matrix form does not make senseLet Φ = [120601(119909

1) 120601(119909

119899)] It has the following SVD

decomposition

Φ = 119880119863119881119879

= [1199061sdot sdot sdot 119906119889119870] [

119863119889times119889

0119889times(119899minus119889)

0(119889119870minus119889)times119889

0(119889119870minus119889)times(119899minus119889)

][[

[

V1198791

V119879119899

]]

]

= 119880119863119881119879

(B1)

where 119880 = [1199061 119906

119889] 119881 = [V

1 V

119889] and 119863 = 119863

119889times119889is a

diagonal matrix of dimension 119889 le 119899

Abstract and Applied Analysis 9

We need to show the KSIR variates

V119895= 119888119879

119895[119896 (119909 119909

1) 119896 (119909 119909

119899)]

= 119888119879

119895Φ119879Φ (119909) = ⟨Φ119888119895 120601 (119909)⟩ = ⟨120573119895 120601 (119909)⟩

(B2)

It suffices to prove that if (120582 119888) is a solution to (28) then (120582 120573)is also a pair of eigenvalue and eigenvector of (Σ2 + 120574119868)minus1ΣΓand vice versa where 119888 and 120573 are related by

120573 = Φ119888 119888 = 119881119863119880119879

120573 (B3)

Noting that facts Σ = (1119899)ΦΦ119879 Γ = (1119899)Φ119869Φ119879 and 119870 =Φ119879Φ = 119881119863

2

119881119879 the argument may be made as follows

ΣΓ120573 = 120582 (Σ2+ 120574119868) 120573

lArrrArr ΦΦ119879Φ119869Φ119879Φ119888 = 120582 (ΦΦ

119879ΦΦ119879Φ119888 + 119899

2120574Φ119888)

lArrrArr Φ119870119869119870119888 = 120582Φ (1198702+ 1198992120574) 119888

(lArrrArr 119881119881119879

119870119869119870119888 = 120582119881119881119879

(1198702+ 1198992120574119868) 119888)

lArrrArr 119870119869119870119888 = 120582 (1198702+ 1198992120574119868) 119888

(B4)

Note that the implication in the third step is necessary only intherArr direction which is obtained by multiplying both sides119881119863minus1

119880119879 and using the facts119880119879119880 = 119868

119889 For the last step since

119881119879

119881 = 119868119889 we use the following facts

119881119881119879

119870 = 119881119881119879

1198811198632

119881119879

= 1198811198632

119881119879

= 119870

119881119881119879

119888 = 119881119881119879

119881119863minus1

119880119879

120573 = 119881119863minus1

119880119879

120573 = 119888

(B5)

In order for this result to hold rigorously when the RKHSis infinite dimensional we need to formally define Φ Φ119879and the SVD of Φ when 119889

119870is infinite For the infinite

dimensional case Φ is an operator from R119899 to H119870defined

by ΦV = sum119899119894=1

V119894120601(119909119894) for V = (V

1 V

119899)119879isin R119899 and Φ119879

is its adjoint an operator from H119870to R119899 such that Φ119879119891 =

(⟨120601(1199091) 119891⟩119870 ⟨120601(119909

1) 119891⟩119870)119879 for 119891 isin H

119870 The notions 119880

and 119880119879 are similarly definedThe above formulation ofΦ andΦ119879 coincides the defini-

tion of Σ as a covariance operator Since the rank of Σ is lessthan 119899 it is compact and has the following representation

Σ =

119889119870

sum

119894=1

119894119906119894otimes 119906119894=

119889

sum

119894=1

120590119894119906119894otimes 119906119894 (B6)

where 119889 le 119899 is the rank and 1205901ge 1205902ge sdot sdot sdot ge 120590

119889gt 120590119889+1=

sdot sdot sdot = 0 This implies that each 120601(119909119894) lies in span(119906

1 119906

119889)

and hence we can write 120601(119909119894) = 119880120591

119894 where 119880 = (119906

1 119906

119889)

should be considered as an operator fromR119889 toH119870and 120591119894isin

R119889 Denote Υ = (1205911 120591

119899)119879isin R119899times119889 It is easy to check that

Υ119879Υ = diag(119899120590

1 119899120590

119889) Let119863

119889times119889= diag(radic1198991205901 radic119899120590119889)

and119881 = Υ119863minus1Then we obtain the SVD forΦ asΦ = 119880119863119881119879

which is well defined

C Proof of Consistency

C1 Preliminaries In order to prove Theorems 9 and 10we use the properties of the Hilbert-Schmidt operatorscovariance operators for the Hilbert space valued randomvariables and the perturbation theory for linear operators Inthis subsection we provide a brief introduction to them Fordetails see [27ndash29] and references therein

Given a separable Hilbert space H of dimension 119901Ha linear operator 119871 on H is said to belong to the Hilbert-Schmidt class if

1198712

HS =

119901H

sum

119894=1

100381710038171003817100381711987111989011989410038171003817100381710038172

Hlt infin (C1)

where 119890119894 is an orthonormal basisTheHilbert-Schmidt class

forms a new Hilbert space with norm sdot HSGiven a bounded operator 119878 onH the operators 119878119871 and

119871119878 both belong to theHilbert-Schmidt class and the followingholds

119878119871HS le 119878 119871HS 119871119878HS le 119871HS 119878 (C2)

where sdot denotes the default operator norm

1198712= sup119891isinH

100381710038171003817100381711987111989110038171003817100381710038172

100381710038171003817100381711989110038171003817100381710038172 (C3)

Let 119885 be a random vector taking values in H satisfyingE1198852lt infin The covariance operator

Σ = E [(119885 minus E119885) otimes (119885 minus E119885)] (C4)

is self-adjoint positive compact and belongs to the Hilbert-Schmidt class

A well-known result from perturbation theory for linearoperators states that if a set of linear operators 119879

119899converges

to 119879 in the Hilbert-Schmidt norm and the eigenvalues of 119879are nondegenerate then the eigenvalues and eigenvectors of119879119899converge to those of 119879 with same rate or convergence as

the convergence of the operators

C2 Proof of Theorem 10 We will use the following resultfrom [14]

10038171003817100381710038171003817Σ minus Σ

10038171003817100381710038171003817HS = 119874119901 (1

radic119899)

10038171003817100381710038171003817Γ minus Γ

10038171003817100381710038171003817HS = 119874119901 (1

radic119899)

(C5)

To simplify the notion we denote 119904= (Σ2+ 119904119868)minus1

ΣΓ Alsodefine

1198791= (Σ2+ 119904119868)minus1

ΣΓ 1198792= (Σ2+ 119904119868)minus1

ΣΓ (C6)

Then10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS

le10038171003817100381710038171003817119904minus 1198791

10038171003817100381710038171003817HS +10038171003817100381710038171198791 minus 1198792

1003817100381710038171003817HS +10038171003817100381710038171198792 minus 119879

1003817100381710038171003817HS

(C7)

10 Abstract and Applied Analysis

For the first term observe that10038171003817100381710038171003817119904minus 1198791

10038171003817100381710038171003817HSle100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817

10038171003817100381710038171003817ΣΓ minus ΣΓ

10038171003817100381710038171003817HS= 119874119901(1

119904radic119899)

(C8)For the second term note that

1198791=

119889Γ

sum

119895=1

120591119895((Σ2+ 119904119868)minus1

Σ119906119895) otimes 119906119895

1198792=

119889Γ

sum

119895=1

120591119895((Σ2+ 119904119868)minus1

119906119895) otimes 119906119895

(C9)

Therefore

10038171003817100381710038171198791 minus 11987921003817100381710038171003817HS =

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1

minus (Σ2+ 119904119868)minus1

) Σ119906119895

100381710038171003817100381710038171003817

(C10)Since 119906

119895isin range(Σ) there exists

119895such that 119906

119895= Σ119895 Then

((Σ2+ 119904119868)minus1

minus (Σ2+ 119904119868)minus1

) Σ119906119895

= (Σ2+ 119904119868)minus1

(Σ2minus Σ2) (Σ2+ 119904119868)minus1

Σ2119895

(C11)

which implies

10038171003817100381710038171198791 minus 11987921003817100381710038171003817 le

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817

10038171003817100381710038171003817Σ2minus Σ210038171003817100381710038171003817HS

times100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ2100381710038171003817100381710038171003817

10038171003817100381710038171003817119895

10038171003817100381710038171003817

= 119874119901(1

119904radic119899)

(C12)

For the third term the following holds

10038171003817100381710038171198792 minus 11987910038171003817100381710038172

HS =

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1

Σ minus Σminus1) 119906119895

100381710038171003817100381710038171003817

2

(C13)

and for each 119895 = 1 119889Γ

100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ119906119895minus Σminus1119906119895

100381710038171003817100381710038171003817

le100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ2119895minus 119895

100381710038171003817100381710038171003817

=

1003817100381710038171003817100381710038171003817100381710038171003817

infin

sum

119894=1

(1199082

119895

119904 + 1199082119895

minus 1)⟨119895 119890119894⟩ 119890119894

1003817100381710038171003817100381710038171003817100381710038171003817

= (

infin

sum

119894=1

1199042

(119904 + 1199082119894)2⟨119895 119890119894⟩2

)

12

le119904

119908119873

(

119873

sum

119894=1

⟨119895 119890119894⟩2

)

12

+ (

infin

sum

119894=119873+1

⟨119895 119890119894⟩2

)

12

=119904

1199082119873

10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp

119873(119895)10038171003817100381710038171003817

(C14)

Combining these terms results in (33)

Since Πperp119873(119895) rarr 0 as119873 rarr infin consequently we have

10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS= 119900119901 (1) (C15)

if 119904 rarr 0 and 119904radic119899 rarr infinIf all the edr directions 120573

119894depend only on a finite

number of eigenvectors of the covariance operator then thereexist some 119873 gt 1 such that Slowast = spanΣ119890

119894 119894 = 1 119873

This implies that

119895= Σminus1119906119895isin Σminus1(Slowast) sub span 119890

119894 119894 = 1 119873 (C16)

Therefore Πperp119873(119895) = 0 Let 119904 = 119874(119899minus14) the rate is119874(11989914)

Acknowledgments

The authors acknowledge the support of the National Sci-ence Foundation (DMS-0732276 and DMS-0732260) and theNational Institutes ofHealth (P50GM081883) Any opinionsfindings and conclusions or recommendations expressed inthis work are those of the authors and do not necessarilyreflect the views of the NSF or NIH

References

[1] K Li ldquoSliced inverse regression for dimension reduction (withdiscussion)rdquo Journal of the American Statistical Association vol86 no 414 pp 316ndash342 1991

[2] N Duan and K Li ldquoSlicing regression a link-free regressionmethodrdquoTheAnnals of Statistics vol 19 no 2 pp 505ndash530 1991

[3] R Cook and S Weisberg ldquoCommentrdquo Journal of the AmericanStatistical Association vol 86 no 414 pp 328ndash332 1991

[4] K C Li ldquoOn principal hessian directions for data visualizationand dimension reduction another application of steinrsquos lemmardquoJournal of the American Statistical Association vol 87 no 420pp 1025ndash1039 1992

[5] L Li R D Cook and C-L Tsai ldquoPartial inverse regressionrdquoBiometrika vol 94 no 3 pp 615ndash625 2007

[6] B Scholkopf A J Smola and K Muller ldquoKernel principalcomponent analysisrdquo in Proceedings of the Artificial NeuralNetworks (ICANN rsquo97) W Gerstner A Germond M Haslerand J-D Nicoud Eds vol 1327 of Lecture Notes in ComputerScience pp 583ndash588 Springer Lausanne Switzerland October1997

[7] B Scholkopf and A J Smola Learning with Kernels The MITPress Massachusetts Mass USA 2002

[8] G Baudat and F Anouar ldquoGeneralized discriminant analysisusing a kernel approachrdquo Neural Computation vol 12 no 10pp 2385ndash2404 2000

[9] F R Bach and M I Jordan ldquoKernel independent componentanalysisrdquo Journal of Machine Learning Research vol 3 no 1 pp1ndash48 2002

[10] H-M Wu ldquoKernel sliced inverse regression with applicationsto classificationrdquo Journal of Computational and Graphical Statis-tics vol 17 no 3 pp 590ndash610 2008

[11] Y-R Yeh S-Y Huang and Y-J Lee ldquoNonlinear dimensionreduction with kernel sliced inverse regressionrdquo IEEE Trans-actions on Knowledge and Data Engineering vol 21 no 11 pp1590ndash1603 2009

Abstract and Applied Analysis 11

[12] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society of London A vol 209 no 441ndash458 pp 415ndash446 1909

[13] H Konig Eigenvalue Distribution of Compact Operators vol16 of Operator Theory Advances and Applications CH BaselSwitzerland 1986

[14] L Ferre and A Yao ldquoFunctional sliced inverse regression anal-ysisrdquo Statistics vol 37 no 6 pp 475ndash488 2003

[15] P Hall and K-C Li ldquoOn almost linearity of low-dimensionalprojections from high-dimensional datardquo The Annals of Statis-tics vol 21 no 2 pp 867ndash889 1993

[16] G He H-G Muller and J-L Wang ldquoFunctional canonicalanalysis for square integrable stochastic processesrdquo Journal ofMultivariate Analysis vol 85 no 1 pp 54ndash77 2003

[17] L Ferre and A Yao ldquoSmoothed functional inverse regressionrdquoStatistica Sinica vol 15 no 3 pp 665ndash683 2005

[18] W Zhong P Zeng P Ma J S Liu and Y Zhu ldquoRSIR regular-ized sliced inverse regression for motif discoveryrdquo Bioinformat-ics vol 21 no 22 pp 4169ndash4175 2005

[19] L Ferre and N Villa ldquoMultilayer perceptron with functionalinputs an inverse regression approachrdquo Scandinavian Journalof Statistics vol 33 no 4 pp 807ndash823 2006

[20] L Li andX Yin ldquoSliced inverse regressionwith regularizationsrdquoBiometrics vol 64 no 1 pp 124ndash131 2008

[21] C Bernard-Michel L Gardes and S Girard ldquoGaussian regular-ized sliced inverse regressionrdquo Statistics and Computing vol 19no 1 pp 85ndash98 2009

[22] T Hsing and R J Carroll ldquoAn asymptotic theory for slicedinverse regressionrdquo The Annals of Statistics vol 20 no 2 pp1040ndash1061 1992

[23] J Saracco ldquoAn asymptotic theory for sliced inverse regressionrdquoCommunications in Statistics vol 26 no 9 pp 2141ndash2171 1997

[24] L X Zhu and K W Ng ldquoAsymptotics of sliced inverse regres-sionrdquo Statistica Sinica vol 5 no 2 pp 727ndash736 1995

[25] T Kurita and T Taguchi ldquoA kernel-based fisher discriminantanalysis for face detectionrdquo IEICE Transactions on Informationand Systems vol 88 no 3 pp 628ndash635 2005

[26] K Fukumizu F R Bach and M I Jordan ldquoKernel dimensionreduction in regressionrdquo The Annals of Statistics vol 37 no 4pp 1871ndash1905 2009

[27] T Kato Perturbation Theory for Linear Operators SpringerBerlin Germany 1966

[28] F Chatelin Spectral Approximation of Linear Operators Aca-demic Press 1983

[29] G Blanchard O Bousquet and L Zwald ldquoStatistical propertiesof kernel principal component analysisrdquoMachine Learning vol66 no 2-3 pp 259ndash294 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 8: Research Article Kernel Sliced Inverse Regression ...downloads.hindawi.com/journals/aaa/2013/540725.pdf · high-dimensional space [ ]. is conforms to the argument in [ ] that the

8 Abstract and Applied Analysis

Table 1 Mean and standard deviations for error rates in classification of digits

Digit RKSIR KSIR RSIR kNN0 00273 (00089) 00472 (00191) 00487 (00128) 00291 (00071)1 00150 (00049) 00177 (00051) 00292 (00113) 00052 (00012)2 01039 (00207) 01475 (00497) 01921 (00238) 02008 (00186)3 00845 (00208) 01279 (00494) 01723 (00283) 01092 (00130)4 00784 (00240) 01044 (00461) 01327 (00327) 01617 (00213)5 00877 (00209) 01327 (00540) 02146 (00294) 01419 (00193)6 00472 (00108) 00804 (00383) 00816 (00172) 00446 (00081)7 00887 (00169) 01119 (00357) 01354 (00172) 01140 (00125)8 00981 (00259) 01490 (00699) 01981 (00286) 01140 (00156)9 00774 (00251) 01095 (00398) 01533 (00212) 02006 (00153)Average 00708 (00105) 01016 (00190) 01358 (00093) 01177 (00039)

operators on such kernel spaces are used to characterize theconditional independence between linear projections and theresponse variable Therefore their method estimates linearedr subspaces while in [10 11] and in this paper RKHS isused to model the edr subspace which leads to nonlineardimension reduction

There are several open issues in regularized kernel SIRmethod such as the selection of kernels regularizationparameters and number of dimensions A direct assessmentof the nonlinear edr directions is expected to reduce thecomputational burden in procedures based on cross valida-tion While these are well established in linear dimensionreduction however little is known for nonlinear dimensionreduction We would like to leave them for future research

There are some interesting connections between KSIRand functional SIR which are developed by Ferre and hiscoauthors in a series of papers [14 17 19] In functional SIRthe observable data are functions and the goal is to find linearedr directions for functional data analysis In KSIR theobservable data are typically not functions but mapped into afunction space in order to characterize nonlinear structuresThis suggests that computations involved in functional SIRcan be simplified by a parametrization with respect to anRKHS or using a linear kernel in the parametrized functionspace On the other hand from a theoretical point of viewKSIR can be viewed as a special case of functional SIRalthough our current theoretical results on KSIR are differentfrom the ones for functional SIR

Appendices

A Proof of Theorem 3

Under the assumption of Proposition 2 for each 119884 = 119910

E [120601 (119883) | 119884 = 119910] isin span Σ120573119894 119894 = 1 119889 (A1)

Since Γ = cov[E(120601(119883) | 119884)] the rank of Γ (ie thedimension of the image of Γ) is less than 119889 which impliesthat Γ is compact With the fact that it is symmetric andsemipositive there exist 119889

Γpositive eigenvalues 120591

119894119889Γ

119894=1and

eigenvectors 119906119894119889Γ

119894=1 such that Γ = sum119889Γ

119894=1120591119894119906119894otimes 119906119894

Recall that for any 119891 isinH

Γ119891 = E ⟨E [120601 (119883) | 119884] 119891⟩E [120601 (119883) | 119884] (A2)

also belongs to

span Σ120573119894 119894 = 1 119889 sub range (Σ) (A3)

because of (A1) so

119906119894=1

120591119894

Γ119906119894isin range (Σ) (A4)

This proves (i)Since for each 119891 isin H Γ119891 isin range (Σminus1) the operator

119879 = Σminus1Γ is well defined over the whole space Moreover

119879119891 = Σminus1(

119889Γ

sum

119894=1

⟨119906119894 119891⟩ 119906119894)

=

119889Γ

sum

119894=1

⟨119906119894 119891⟩ Σ

minus1(119906119894) = (

119889Γ

sum

119894=1

Σminus1(119906119894) otimes 119906119894)119891

(A5)

This proves (ii)

B Proof of Proposition 7

We first prove the proposition for matrices to simplify thennotation we then extend the result to the operators where119889119870is infinite and a matrix form does not make senseLet Φ = [120601(119909

1) 120601(119909

119899)] It has the following SVD

decomposition

Φ = 119880119863119881119879

= [1199061sdot sdot sdot 119906119889119870] [

119863119889times119889

0119889times(119899minus119889)

0(119889119870minus119889)times119889

0(119889119870minus119889)times(119899minus119889)

][[

[

V1198791

V119879119899

]]

]

= 119880119863119881119879

(B1)

where 119880 = [1199061 119906

119889] 119881 = [V

1 V

119889] and 119863 = 119863

119889times119889is a

diagonal matrix of dimension 119889 le 119899

Abstract and Applied Analysis 9

We need to show the KSIR variates

V119895= 119888119879

119895[119896 (119909 119909

1) 119896 (119909 119909

119899)]

= 119888119879

119895Φ119879Φ (119909) = ⟨Φ119888119895 120601 (119909)⟩ = ⟨120573119895 120601 (119909)⟩

(B2)

It suffices to prove that if (120582 119888) is a solution to (28) then (120582 120573)is also a pair of eigenvalue and eigenvector of (Σ2 + 120574119868)minus1ΣΓand vice versa where 119888 and 120573 are related by

120573 = Φ119888 119888 = 119881119863119880119879

120573 (B3)

Noting that facts Σ = (1119899)ΦΦ119879 Γ = (1119899)Φ119869Φ119879 and 119870 =Φ119879Φ = 119881119863

2

119881119879 the argument may be made as follows

ΣΓ120573 = 120582 (Σ2+ 120574119868) 120573

lArrrArr ΦΦ119879Φ119869Φ119879Φ119888 = 120582 (ΦΦ

119879ΦΦ119879Φ119888 + 119899

2120574Φ119888)

lArrrArr Φ119870119869119870119888 = 120582Φ (1198702+ 1198992120574) 119888

(lArrrArr 119881119881119879

119870119869119870119888 = 120582119881119881119879

(1198702+ 1198992120574119868) 119888)

lArrrArr 119870119869119870119888 = 120582 (1198702+ 1198992120574119868) 119888

(B4)

Note that the implication in the third step is necessary only intherArr direction which is obtained by multiplying both sides119881119863minus1

119880119879 and using the facts119880119879119880 = 119868

119889 For the last step since

119881119879

119881 = 119868119889 we use the following facts

119881119881119879

119870 = 119881119881119879

1198811198632

119881119879

= 1198811198632

119881119879

= 119870

119881119881119879

119888 = 119881119881119879

119881119863minus1

119880119879

120573 = 119881119863minus1

119880119879

120573 = 119888

(B5)

In order for this result to hold rigorously when the RKHSis infinite dimensional we need to formally define Φ Φ119879and the SVD of Φ when 119889

119870is infinite For the infinite

dimensional case Φ is an operator from R119899 to H119870defined

by ΦV = sum119899119894=1

V119894120601(119909119894) for V = (V

1 V

119899)119879isin R119899 and Φ119879

is its adjoint an operator from H119870to R119899 such that Φ119879119891 =

(⟨120601(1199091) 119891⟩119870 ⟨120601(119909

1) 119891⟩119870)119879 for 119891 isin H

119870 The notions 119880

and 119880119879 are similarly definedThe above formulation ofΦ andΦ119879 coincides the defini-

tion of Σ as a covariance operator Since the rank of Σ is lessthan 119899 it is compact and has the following representation

Σ =

119889119870

sum

119894=1

119894119906119894otimes 119906119894=

119889

sum

119894=1

120590119894119906119894otimes 119906119894 (B6)

where 119889 le 119899 is the rank and 1205901ge 1205902ge sdot sdot sdot ge 120590

119889gt 120590119889+1=

sdot sdot sdot = 0 This implies that each 120601(119909119894) lies in span(119906

1 119906

119889)

and hence we can write 120601(119909119894) = 119880120591

119894 where 119880 = (119906

1 119906

119889)

should be considered as an operator fromR119889 toH119870and 120591119894isin

R119889 Denote Υ = (1205911 120591

119899)119879isin R119899times119889 It is easy to check that

Υ119879Υ = diag(119899120590

1 119899120590

119889) Let119863

119889times119889= diag(radic1198991205901 radic119899120590119889)

and119881 = Υ119863minus1Then we obtain the SVD forΦ asΦ = 119880119863119881119879

which is well defined

C Proof of Consistency

C1 Preliminaries In order to prove Theorems 9 and 10we use the properties of the Hilbert-Schmidt operatorscovariance operators for the Hilbert space valued randomvariables and the perturbation theory for linear operators Inthis subsection we provide a brief introduction to them Fordetails see [27ndash29] and references therein

Given a separable Hilbert space H of dimension 119901Ha linear operator 119871 on H is said to belong to the Hilbert-Schmidt class if

1198712

HS =

119901H

sum

119894=1

100381710038171003817100381711987111989011989410038171003817100381710038172

Hlt infin (C1)

where 119890119894 is an orthonormal basisTheHilbert-Schmidt class

forms a new Hilbert space with norm sdot HSGiven a bounded operator 119878 onH the operators 119878119871 and

119871119878 both belong to theHilbert-Schmidt class and the followingholds

119878119871HS le 119878 119871HS 119871119878HS le 119871HS 119878 (C2)

where sdot denotes the default operator norm

1198712= sup119891isinH

100381710038171003817100381711987111989110038171003817100381710038172

100381710038171003817100381711989110038171003817100381710038172 (C3)

Let 119885 be a random vector taking values in H satisfyingE1198852lt infin The covariance operator

Σ = E [(119885 minus E119885) otimes (119885 minus E119885)] (C4)

is self-adjoint positive compact and belongs to the Hilbert-Schmidt class

A well-known result from perturbation theory for linearoperators states that if a set of linear operators 119879

119899converges

to 119879 in the Hilbert-Schmidt norm and the eigenvalues of 119879are nondegenerate then the eigenvalues and eigenvectors of119879119899converge to those of 119879 with same rate or convergence as

the convergence of the operators

C2 Proof of Theorem 10 We will use the following resultfrom [14]

10038171003817100381710038171003817Σ minus Σ

10038171003817100381710038171003817HS = 119874119901 (1

radic119899)

10038171003817100381710038171003817Γ minus Γ

10038171003817100381710038171003817HS = 119874119901 (1

radic119899)

(C5)

To simplify the notion we denote 119904= (Σ2+ 119904119868)minus1

ΣΓ Alsodefine

1198791= (Σ2+ 119904119868)minus1

ΣΓ 1198792= (Σ2+ 119904119868)minus1

ΣΓ (C6)

Then10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS

le10038171003817100381710038171003817119904minus 1198791

10038171003817100381710038171003817HS +10038171003817100381710038171198791 minus 1198792

1003817100381710038171003817HS +10038171003817100381710038171198792 minus 119879

1003817100381710038171003817HS

(C7)

10 Abstract and Applied Analysis

For the first term observe that10038171003817100381710038171003817119904minus 1198791

10038171003817100381710038171003817HSle100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817

10038171003817100381710038171003817ΣΓ minus ΣΓ

10038171003817100381710038171003817HS= 119874119901(1

119904radic119899)

(C8)For the second term note that

1198791=

119889Γ

sum

119895=1

120591119895((Σ2+ 119904119868)minus1

Σ119906119895) otimes 119906119895

1198792=

119889Γ

sum

119895=1

120591119895((Σ2+ 119904119868)minus1

119906119895) otimes 119906119895

(C9)

Therefore

10038171003817100381710038171198791 minus 11987921003817100381710038171003817HS =

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1

minus (Σ2+ 119904119868)minus1

) Σ119906119895

100381710038171003817100381710038171003817

(C10)Since 119906

119895isin range(Σ) there exists

119895such that 119906

119895= Σ119895 Then

((Σ2+ 119904119868)minus1

minus (Σ2+ 119904119868)minus1

) Σ119906119895

= (Σ2+ 119904119868)minus1

(Σ2minus Σ2) (Σ2+ 119904119868)minus1

Σ2119895

(C11)

which implies

10038171003817100381710038171198791 minus 11987921003817100381710038171003817 le

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817

10038171003817100381710038171003817Σ2minus Σ210038171003817100381710038171003817HS

times100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ2100381710038171003817100381710038171003817

10038171003817100381710038171003817119895

10038171003817100381710038171003817

= 119874119901(1

119904radic119899)

(C12)

For the third term the following holds

10038171003817100381710038171198792 minus 11987910038171003817100381710038172

HS =

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1

Σ minus Σminus1) 119906119895

100381710038171003817100381710038171003817

2

(C13)

and for each 119895 = 1 119889Γ

100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ119906119895minus Σminus1119906119895

100381710038171003817100381710038171003817

le100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ2119895minus 119895

100381710038171003817100381710038171003817

=

1003817100381710038171003817100381710038171003817100381710038171003817

infin

sum

119894=1

(1199082

119895

119904 + 1199082119895

minus 1)⟨119895 119890119894⟩ 119890119894

1003817100381710038171003817100381710038171003817100381710038171003817

= (

infin

sum

119894=1

1199042

(119904 + 1199082119894)2⟨119895 119890119894⟩2

)

12

le119904

119908119873

(

119873

sum

119894=1

⟨119895 119890119894⟩2

)

12

+ (

infin

sum

119894=119873+1

⟨119895 119890119894⟩2

)

12

=119904

1199082119873

10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp

119873(119895)10038171003817100381710038171003817

(C14)

Combining these terms results in (33)

Since Πperp119873(119895) rarr 0 as119873 rarr infin consequently we have

10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS= 119900119901 (1) (C15)

if 119904 rarr 0 and 119904radic119899 rarr infinIf all the edr directions 120573

119894depend only on a finite

number of eigenvectors of the covariance operator then thereexist some 119873 gt 1 such that Slowast = spanΣ119890

119894 119894 = 1 119873

This implies that

119895= Σminus1119906119895isin Σminus1(Slowast) sub span 119890

119894 119894 = 1 119873 (C16)

Therefore Πperp119873(119895) = 0 Let 119904 = 119874(119899minus14) the rate is119874(11989914)

Acknowledgments

The authors acknowledge the support of the National Sci-ence Foundation (DMS-0732276 and DMS-0732260) and theNational Institutes ofHealth (P50GM081883) Any opinionsfindings and conclusions or recommendations expressed inthis work are those of the authors and do not necessarilyreflect the views of the NSF or NIH

References

[1] K Li ldquoSliced inverse regression for dimension reduction (withdiscussion)rdquo Journal of the American Statistical Association vol86 no 414 pp 316ndash342 1991

[2] N Duan and K Li ldquoSlicing regression a link-free regressionmethodrdquoTheAnnals of Statistics vol 19 no 2 pp 505ndash530 1991

[3] R Cook and S Weisberg ldquoCommentrdquo Journal of the AmericanStatistical Association vol 86 no 414 pp 328ndash332 1991

[4] K C Li ldquoOn principal hessian directions for data visualizationand dimension reduction another application of steinrsquos lemmardquoJournal of the American Statistical Association vol 87 no 420pp 1025ndash1039 1992

[5] L Li R D Cook and C-L Tsai ldquoPartial inverse regressionrdquoBiometrika vol 94 no 3 pp 615ndash625 2007

[6] B Scholkopf A J Smola and K Muller ldquoKernel principalcomponent analysisrdquo in Proceedings of the Artificial NeuralNetworks (ICANN rsquo97) W Gerstner A Germond M Haslerand J-D Nicoud Eds vol 1327 of Lecture Notes in ComputerScience pp 583ndash588 Springer Lausanne Switzerland October1997

[7] B Scholkopf and A J Smola Learning with Kernels The MITPress Massachusetts Mass USA 2002

[8] G Baudat and F Anouar ldquoGeneralized discriminant analysisusing a kernel approachrdquo Neural Computation vol 12 no 10pp 2385ndash2404 2000

[9] F R Bach and M I Jordan ldquoKernel independent componentanalysisrdquo Journal of Machine Learning Research vol 3 no 1 pp1ndash48 2002

[10] H-M Wu ldquoKernel sliced inverse regression with applicationsto classificationrdquo Journal of Computational and Graphical Statis-tics vol 17 no 3 pp 590ndash610 2008

[11] Y-R Yeh S-Y Huang and Y-J Lee ldquoNonlinear dimensionreduction with kernel sliced inverse regressionrdquo IEEE Trans-actions on Knowledge and Data Engineering vol 21 no 11 pp1590ndash1603 2009

Abstract and Applied Analysis 11

[12] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society of London A vol 209 no 441ndash458 pp 415ndash446 1909

[13] H Konig Eigenvalue Distribution of Compact Operators vol16 of Operator Theory Advances and Applications CH BaselSwitzerland 1986

[14] L Ferre and A Yao ldquoFunctional sliced inverse regression anal-ysisrdquo Statistics vol 37 no 6 pp 475ndash488 2003

[15] P Hall and K-C Li ldquoOn almost linearity of low-dimensionalprojections from high-dimensional datardquo The Annals of Statis-tics vol 21 no 2 pp 867ndash889 1993

[16] G He H-G Muller and J-L Wang ldquoFunctional canonicalanalysis for square integrable stochastic processesrdquo Journal ofMultivariate Analysis vol 85 no 1 pp 54ndash77 2003

[17] L Ferre and A Yao ldquoSmoothed functional inverse regressionrdquoStatistica Sinica vol 15 no 3 pp 665ndash683 2005

[18] W Zhong P Zeng P Ma J S Liu and Y Zhu ldquoRSIR regular-ized sliced inverse regression for motif discoveryrdquo Bioinformat-ics vol 21 no 22 pp 4169ndash4175 2005

[19] L Ferre and N Villa ldquoMultilayer perceptron with functionalinputs an inverse regression approachrdquo Scandinavian Journalof Statistics vol 33 no 4 pp 807ndash823 2006

[20] L Li andX Yin ldquoSliced inverse regressionwith regularizationsrdquoBiometrics vol 64 no 1 pp 124ndash131 2008

[21] C Bernard-Michel L Gardes and S Girard ldquoGaussian regular-ized sliced inverse regressionrdquo Statistics and Computing vol 19no 1 pp 85ndash98 2009

[22] T Hsing and R J Carroll ldquoAn asymptotic theory for slicedinverse regressionrdquo The Annals of Statistics vol 20 no 2 pp1040ndash1061 1992

[23] J Saracco ldquoAn asymptotic theory for sliced inverse regressionrdquoCommunications in Statistics vol 26 no 9 pp 2141ndash2171 1997

[24] L X Zhu and K W Ng ldquoAsymptotics of sliced inverse regres-sionrdquo Statistica Sinica vol 5 no 2 pp 727ndash736 1995

[25] T Kurita and T Taguchi ldquoA kernel-based fisher discriminantanalysis for face detectionrdquo IEICE Transactions on Informationand Systems vol 88 no 3 pp 628ndash635 2005

[26] K Fukumizu F R Bach and M I Jordan ldquoKernel dimensionreduction in regressionrdquo The Annals of Statistics vol 37 no 4pp 1871ndash1905 2009

[27] T Kato Perturbation Theory for Linear Operators SpringerBerlin Germany 1966

[28] F Chatelin Spectral Approximation of Linear Operators Aca-demic Press 1983

[29] G Blanchard O Bousquet and L Zwald ldquoStatistical propertiesof kernel principal component analysisrdquoMachine Learning vol66 no 2-3 pp 259ndash294 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 9: Research Article Kernel Sliced Inverse Regression ...downloads.hindawi.com/journals/aaa/2013/540725.pdf · high-dimensional space [ ]. is conforms to the argument in [ ] that the

Abstract and Applied Analysis 9

We need to show the KSIR variates

V119895= 119888119879

119895[119896 (119909 119909

1) 119896 (119909 119909

119899)]

= 119888119879

119895Φ119879Φ (119909) = ⟨Φ119888119895 120601 (119909)⟩ = ⟨120573119895 120601 (119909)⟩

(B2)

It suffices to prove that if (120582 119888) is a solution to (28) then (120582 120573)is also a pair of eigenvalue and eigenvector of (Σ2 + 120574119868)minus1ΣΓand vice versa where 119888 and 120573 are related by

120573 = Φ119888 119888 = 119881119863119880119879

120573 (B3)

Noting that facts Σ = (1119899)ΦΦ119879 Γ = (1119899)Φ119869Φ119879 and 119870 =Φ119879Φ = 119881119863

2

119881119879 the argument may be made as follows

ΣΓ120573 = 120582 (Σ2+ 120574119868) 120573

lArrrArr ΦΦ119879Φ119869Φ119879Φ119888 = 120582 (ΦΦ

119879ΦΦ119879Φ119888 + 119899

2120574Φ119888)

lArrrArr Φ119870119869119870119888 = 120582Φ (1198702+ 1198992120574) 119888

(lArrrArr 119881119881119879

119870119869119870119888 = 120582119881119881119879

(1198702+ 1198992120574119868) 119888)

lArrrArr 119870119869119870119888 = 120582 (1198702+ 1198992120574119868) 119888

(B4)

Note that the implication in the third step is necessary only intherArr direction which is obtained by multiplying both sides119881119863minus1

119880119879 and using the facts119880119879119880 = 119868

119889 For the last step since

119881119879

119881 = 119868119889 we use the following facts

119881119881119879

119870 = 119881119881119879

1198811198632

119881119879

= 1198811198632

119881119879

= 119870

119881119881119879

119888 = 119881119881119879

119881119863minus1

119880119879

120573 = 119881119863minus1

119880119879

120573 = 119888

(B5)

In order for this result to hold rigorously when the RKHSis infinite dimensional we need to formally define Φ Φ119879and the SVD of Φ when 119889

119870is infinite For the infinite

dimensional case Φ is an operator from R119899 to H119870defined

by ΦV = sum119899119894=1

V119894120601(119909119894) for V = (V

1 V

119899)119879isin R119899 and Φ119879

is its adjoint an operator from H119870to R119899 such that Φ119879119891 =

(⟨120601(1199091) 119891⟩119870 ⟨120601(119909

1) 119891⟩119870)119879 for 119891 isin H

119870 The notions 119880

and 119880119879 are similarly definedThe above formulation ofΦ andΦ119879 coincides the defini-

tion of Σ as a covariance operator Since the rank of Σ is lessthan 119899 it is compact and has the following representation

Σ =

119889119870

sum

119894=1

119894119906119894otimes 119906119894=

119889

sum

119894=1

120590119894119906119894otimes 119906119894 (B6)

where 119889 le 119899 is the rank and 1205901ge 1205902ge sdot sdot sdot ge 120590

119889gt 120590119889+1=

sdot sdot sdot = 0 This implies that each 120601(119909119894) lies in span(119906

1 119906

119889)

and hence we can write 120601(119909119894) = 119880120591

119894 where 119880 = (119906

1 119906

119889)

should be considered as an operator fromR119889 toH119870and 120591119894isin

R119889 Denote Υ = (1205911 120591

119899)119879isin R119899times119889 It is easy to check that

Υ119879Υ = diag(119899120590

1 119899120590

119889) Let119863

119889times119889= diag(radic1198991205901 radic119899120590119889)

and119881 = Υ119863minus1Then we obtain the SVD forΦ asΦ = 119880119863119881119879

which is well defined

C Proof of Consistency

C1 Preliminaries In order to prove Theorems 9 and 10we use the properties of the Hilbert-Schmidt operatorscovariance operators for the Hilbert space valued randomvariables and the perturbation theory for linear operators Inthis subsection we provide a brief introduction to them Fordetails see [27ndash29] and references therein

Given a separable Hilbert space H of dimension 119901Ha linear operator 119871 on H is said to belong to the Hilbert-Schmidt class if

1198712

HS =

119901H

sum

119894=1

100381710038171003817100381711987111989011989410038171003817100381710038172

Hlt infin (C1)

where 119890119894 is an orthonormal basisTheHilbert-Schmidt class

forms a new Hilbert space with norm sdot HSGiven a bounded operator 119878 onH the operators 119878119871 and

119871119878 both belong to theHilbert-Schmidt class and the followingholds

119878119871HS le 119878 119871HS 119871119878HS le 119871HS 119878 (C2)

where sdot denotes the default operator norm

1198712= sup119891isinH

100381710038171003817100381711987111989110038171003817100381710038172

100381710038171003817100381711989110038171003817100381710038172 (C3)

Let 119885 be a random vector taking values in H satisfyingE1198852lt infin The covariance operator

Σ = E [(119885 minus E119885) otimes (119885 minus E119885)] (C4)

is self-adjoint positive compact and belongs to the Hilbert-Schmidt class

A well-known result from perturbation theory for linearoperators states that if a set of linear operators 119879

119899converges

to 119879 in the Hilbert-Schmidt norm and the eigenvalues of 119879are nondegenerate then the eigenvalues and eigenvectors of119879119899converge to those of 119879 with same rate or convergence as

the convergence of the operators

C2 Proof of Theorem 10 We will use the following resultfrom [14]

10038171003817100381710038171003817Σ minus Σ

10038171003817100381710038171003817HS = 119874119901 (1

radic119899)

10038171003817100381710038171003817Γ minus Γ

10038171003817100381710038171003817HS = 119874119901 (1

radic119899)

(C5)

To simplify the notion we denote 119904= (Σ2+ 119904119868)minus1

ΣΓ Alsodefine

1198791= (Σ2+ 119904119868)minus1

ΣΓ 1198792= (Σ2+ 119904119868)minus1

ΣΓ (C6)

Then10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS

le10038171003817100381710038171003817119904minus 1198791

10038171003817100381710038171003817HS +10038171003817100381710038171198791 minus 1198792

1003817100381710038171003817HS +10038171003817100381710038171198792 minus 119879

1003817100381710038171003817HS

(C7)

10 Abstract and Applied Analysis

For the first term observe that10038171003817100381710038171003817119904minus 1198791

10038171003817100381710038171003817HSle100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817

10038171003817100381710038171003817ΣΓ minus ΣΓ

10038171003817100381710038171003817HS= 119874119901(1

119904radic119899)

(C8)For the second term note that

1198791=

119889Γ

sum

119895=1

120591119895((Σ2+ 119904119868)minus1

Σ119906119895) otimes 119906119895

1198792=

119889Γ

sum

119895=1

120591119895((Σ2+ 119904119868)minus1

119906119895) otimes 119906119895

(C9)

Therefore

10038171003817100381710038171198791 minus 11987921003817100381710038171003817HS =

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1

minus (Σ2+ 119904119868)minus1

) Σ119906119895

100381710038171003817100381710038171003817

(C10)Since 119906

119895isin range(Σ) there exists

119895such that 119906

119895= Σ119895 Then

((Σ2+ 119904119868)minus1

minus (Σ2+ 119904119868)minus1

) Σ119906119895

= (Σ2+ 119904119868)minus1

(Σ2minus Σ2) (Σ2+ 119904119868)minus1

Σ2119895

(C11)

which implies

10038171003817100381710038171198791 minus 11987921003817100381710038171003817 le

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817

10038171003817100381710038171003817Σ2minus Σ210038171003817100381710038171003817HS

times100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ2100381710038171003817100381710038171003817

10038171003817100381710038171003817119895

10038171003817100381710038171003817

= 119874119901(1

119904radic119899)

(C12)

For the third term the following holds

10038171003817100381710038171198792 minus 11987910038171003817100381710038172

HS =

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1

Σ minus Σminus1) 119906119895

100381710038171003817100381710038171003817

2

(C13)

and for each 119895 = 1 119889Γ

100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ119906119895minus Σminus1119906119895

100381710038171003817100381710038171003817

le100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ2119895minus 119895

100381710038171003817100381710038171003817

=

1003817100381710038171003817100381710038171003817100381710038171003817

infin

sum

119894=1

(1199082

119895

119904 + 1199082119895

minus 1)⟨119895 119890119894⟩ 119890119894

1003817100381710038171003817100381710038171003817100381710038171003817

= (

infin

sum

119894=1

1199042

(119904 + 1199082119894)2⟨119895 119890119894⟩2

)

12

le119904

119908119873

(

119873

sum

119894=1

⟨119895 119890119894⟩2

)

12

+ (

infin

sum

119894=119873+1

⟨119895 119890119894⟩2

)

12

=119904

1199082119873

10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp

119873(119895)10038171003817100381710038171003817

(C14)

Combining these terms results in (33)

Since Πperp119873(119895) rarr 0 as119873 rarr infin consequently we have

10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS= 119900119901 (1) (C15)

if 119904 rarr 0 and 119904radic119899 rarr infinIf all the edr directions 120573

119894depend only on a finite

number of eigenvectors of the covariance operator then thereexist some 119873 gt 1 such that Slowast = spanΣ119890

119894 119894 = 1 119873

This implies that

119895= Σminus1119906119895isin Σminus1(Slowast) sub span 119890

119894 119894 = 1 119873 (C16)

Therefore Πperp119873(119895) = 0 Let 119904 = 119874(119899minus14) the rate is119874(11989914)

Acknowledgments

The authors acknowledge the support of the National Sci-ence Foundation (DMS-0732276 and DMS-0732260) and theNational Institutes ofHealth (P50GM081883) Any opinionsfindings and conclusions or recommendations expressed inthis work are those of the authors and do not necessarilyreflect the views of the NSF or NIH

References

[1] K Li ldquoSliced inverse regression for dimension reduction (withdiscussion)rdquo Journal of the American Statistical Association vol86 no 414 pp 316ndash342 1991

[2] N Duan and K Li ldquoSlicing regression a link-free regressionmethodrdquoTheAnnals of Statistics vol 19 no 2 pp 505ndash530 1991

[3] R Cook and S Weisberg ldquoCommentrdquo Journal of the AmericanStatistical Association vol 86 no 414 pp 328ndash332 1991

[4] K C Li ldquoOn principal hessian directions for data visualizationand dimension reduction another application of steinrsquos lemmardquoJournal of the American Statistical Association vol 87 no 420pp 1025ndash1039 1992

[5] L Li R D Cook and C-L Tsai ldquoPartial inverse regressionrdquoBiometrika vol 94 no 3 pp 615ndash625 2007

[6] B Scholkopf A J Smola and K Muller ldquoKernel principalcomponent analysisrdquo in Proceedings of the Artificial NeuralNetworks (ICANN rsquo97) W Gerstner A Germond M Haslerand J-D Nicoud Eds vol 1327 of Lecture Notes in ComputerScience pp 583ndash588 Springer Lausanne Switzerland October1997

[7] B Scholkopf and A J Smola Learning with Kernels The MITPress Massachusetts Mass USA 2002

[8] G Baudat and F Anouar ldquoGeneralized discriminant analysisusing a kernel approachrdquo Neural Computation vol 12 no 10pp 2385ndash2404 2000

[9] F R Bach and M I Jordan ldquoKernel independent componentanalysisrdquo Journal of Machine Learning Research vol 3 no 1 pp1ndash48 2002

[10] H-M Wu ldquoKernel sliced inverse regression with applicationsto classificationrdquo Journal of Computational and Graphical Statis-tics vol 17 no 3 pp 590ndash610 2008

[11] Y-R Yeh S-Y Huang and Y-J Lee ldquoNonlinear dimensionreduction with kernel sliced inverse regressionrdquo IEEE Trans-actions on Knowledge and Data Engineering vol 21 no 11 pp1590ndash1603 2009

Abstract and Applied Analysis 11

[12] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society of London A vol 209 no 441ndash458 pp 415ndash446 1909

[13] H Konig Eigenvalue Distribution of Compact Operators vol16 of Operator Theory Advances and Applications CH BaselSwitzerland 1986

[14] L Ferre and A Yao ldquoFunctional sliced inverse regression anal-ysisrdquo Statistics vol 37 no 6 pp 475ndash488 2003

[15] P Hall and K-C Li ldquoOn almost linearity of low-dimensionalprojections from high-dimensional datardquo The Annals of Statis-tics vol 21 no 2 pp 867ndash889 1993

[16] G He H-G Muller and J-L Wang ldquoFunctional canonicalanalysis for square integrable stochastic processesrdquo Journal ofMultivariate Analysis vol 85 no 1 pp 54ndash77 2003

[17] L Ferre and A Yao ldquoSmoothed functional inverse regressionrdquoStatistica Sinica vol 15 no 3 pp 665ndash683 2005

[18] W Zhong P Zeng P Ma J S Liu and Y Zhu ldquoRSIR regular-ized sliced inverse regression for motif discoveryrdquo Bioinformat-ics vol 21 no 22 pp 4169ndash4175 2005

[19] L Ferre and N Villa ldquoMultilayer perceptron with functionalinputs an inverse regression approachrdquo Scandinavian Journalof Statistics vol 33 no 4 pp 807ndash823 2006

[20] L Li andX Yin ldquoSliced inverse regressionwith regularizationsrdquoBiometrics vol 64 no 1 pp 124ndash131 2008

[21] C Bernard-Michel L Gardes and S Girard ldquoGaussian regular-ized sliced inverse regressionrdquo Statistics and Computing vol 19no 1 pp 85ndash98 2009

[22] T Hsing and R J Carroll ldquoAn asymptotic theory for slicedinverse regressionrdquo The Annals of Statistics vol 20 no 2 pp1040ndash1061 1992

[23] J Saracco ldquoAn asymptotic theory for sliced inverse regressionrdquoCommunications in Statistics vol 26 no 9 pp 2141ndash2171 1997

[24] L X Zhu and K W Ng ldquoAsymptotics of sliced inverse regres-sionrdquo Statistica Sinica vol 5 no 2 pp 727ndash736 1995

[25] T Kurita and T Taguchi ldquoA kernel-based fisher discriminantanalysis for face detectionrdquo IEICE Transactions on Informationand Systems vol 88 no 3 pp 628ndash635 2005

[26] K Fukumizu F R Bach and M I Jordan ldquoKernel dimensionreduction in regressionrdquo The Annals of Statistics vol 37 no 4pp 1871ndash1905 2009

[27] T Kato Perturbation Theory for Linear Operators SpringerBerlin Germany 1966

[28] F Chatelin Spectral Approximation of Linear Operators Aca-demic Press 1983

[29] G Blanchard O Bousquet and L Zwald ldquoStatistical propertiesof kernel principal component analysisrdquoMachine Learning vol66 no 2-3 pp 259ndash294 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 10: Research Article Kernel Sliced Inverse Regression ...downloads.hindawi.com/journals/aaa/2013/540725.pdf · high-dimensional space [ ]. is conforms to the argument in [ ] that the

10 Abstract and Applied Analysis

For the first term observe that10038171003817100381710038171003817119904minus 1198791

10038171003817100381710038171003817HSle100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817

10038171003817100381710038171003817ΣΓ minus ΣΓ

10038171003817100381710038171003817HS= 119874119901(1

119904radic119899)

(C8)For the second term note that

1198791=

119889Γ

sum

119895=1

120591119895((Σ2+ 119904119868)minus1

Σ119906119895) otimes 119906119895

1198792=

119889Γ

sum

119895=1

120591119895((Σ2+ 119904119868)minus1

119906119895) otimes 119906119895

(C9)

Therefore

10038171003817100381710038171198791 minus 11987921003817100381710038171003817HS =

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1

minus (Σ2+ 119904119868)minus1

) Σ119906119895

100381710038171003817100381710038171003817

(C10)Since 119906

119895isin range(Σ) there exists

119895such that 119906

119895= Σ119895 Then

((Σ2+ 119904119868)minus1

minus (Σ2+ 119904119868)minus1

) Σ119906119895

= (Σ2+ 119904119868)minus1

(Σ2minus Σ2) (Σ2+ 119904119868)minus1

Σ2119895

(C11)

which implies

10038171003817100381710038171198791 minus 11987921003817100381710038171003817 le

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817

10038171003817100381710038171003817Σ2minus Σ210038171003817100381710038171003817HS

times100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ2100381710038171003817100381710038171003817

10038171003817100381710038171003817119895

10038171003817100381710038171003817

= 119874119901(1

119904radic119899)

(C12)

For the third term the following holds

10038171003817100381710038171198792 minus 11987910038171003817100381710038172

HS =

119889Γ

sum

119895=1

120591119895

100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1

Σ minus Σminus1) 119906119895

100381710038171003817100381710038171003817

2

(C13)

and for each 119895 = 1 119889Γ

100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ119906119895minus Σminus1119906119895

100381710038171003817100381710038171003817

le100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1

Σ2119895minus 119895

100381710038171003817100381710038171003817

=

1003817100381710038171003817100381710038171003817100381710038171003817

infin

sum

119894=1

(1199082

119895

119904 + 1199082119895

minus 1)⟨119895 119890119894⟩ 119890119894

1003817100381710038171003817100381710038171003817100381710038171003817

= (

infin

sum

119894=1

1199042

(119904 + 1199082119894)2⟨119895 119890119894⟩2

)

12

le119904

119908119873

(

119873

sum

119894=1

⟨119895 119890119894⟩2

)

12

+ (

infin

sum

119894=119873+1

⟨119895 119890119894⟩2

)

12

=119904

1199082119873

10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp

119873(119895)10038171003817100381710038171003817

(C14)

Combining these terms results in (33)

Since Πperp119873(119895) rarr 0 as119873 rarr infin consequently we have

10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS= 119900119901 (1) (C15)

if 119904 rarr 0 and 119904radic119899 rarr infinIf all the edr directions 120573

119894depend only on a finite

number of eigenvectors of the covariance operator then thereexist some 119873 gt 1 such that Slowast = spanΣ119890

119894 119894 = 1 119873

This implies that

119895= Σminus1119906119895isin Σminus1(Slowast) sub span 119890

119894 119894 = 1 119873 (C16)

Therefore Πperp119873(119895) = 0 Let 119904 = 119874(119899minus14) the rate is119874(11989914)

Acknowledgments

The authors acknowledge the support of the National Sci-ence Foundation (DMS-0732276 and DMS-0732260) and theNational Institutes ofHealth (P50GM081883) Any opinionsfindings and conclusions or recommendations expressed inthis work are those of the authors and do not necessarilyreflect the views of the NSF or NIH

References

[1] K Li ldquoSliced inverse regression for dimension reduction (withdiscussion)rdquo Journal of the American Statistical Association vol86 no 414 pp 316ndash342 1991

[2] N Duan and K Li ldquoSlicing regression a link-free regressionmethodrdquoTheAnnals of Statistics vol 19 no 2 pp 505ndash530 1991

[3] R Cook and S Weisberg ldquoCommentrdquo Journal of the AmericanStatistical Association vol 86 no 414 pp 328ndash332 1991

[4] K C Li ldquoOn principal hessian directions for data visualizationand dimension reduction another application of steinrsquos lemmardquoJournal of the American Statistical Association vol 87 no 420pp 1025ndash1039 1992

[5] L Li R D Cook and C-L Tsai ldquoPartial inverse regressionrdquoBiometrika vol 94 no 3 pp 615ndash625 2007

[6] B Scholkopf A J Smola and K Muller ldquoKernel principalcomponent analysisrdquo in Proceedings of the Artificial NeuralNetworks (ICANN rsquo97) W Gerstner A Germond M Haslerand J-D Nicoud Eds vol 1327 of Lecture Notes in ComputerScience pp 583ndash588 Springer Lausanne Switzerland October1997

[7] B Scholkopf and A J Smola Learning with Kernels The MITPress Massachusetts Mass USA 2002

[8] G Baudat and F Anouar ldquoGeneralized discriminant analysisusing a kernel approachrdquo Neural Computation vol 12 no 10pp 2385ndash2404 2000

[9] F R Bach and M I Jordan ldquoKernel independent componentanalysisrdquo Journal of Machine Learning Research vol 3 no 1 pp1ndash48 2002

[10] H-M Wu ldquoKernel sliced inverse regression with applicationsto classificationrdquo Journal of Computational and Graphical Statis-tics vol 17 no 3 pp 590ndash610 2008

[11] Y-R Yeh S-Y Huang and Y-J Lee ldquoNonlinear dimensionreduction with kernel sliced inverse regressionrdquo IEEE Trans-actions on Knowledge and Data Engineering vol 21 no 11 pp1590ndash1603 2009

Abstract and Applied Analysis 11

[12] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society of London A vol 209 no 441ndash458 pp 415ndash446 1909

[13] H Konig Eigenvalue Distribution of Compact Operators vol16 of Operator Theory Advances and Applications CH BaselSwitzerland 1986

[14] L Ferre and A Yao ldquoFunctional sliced inverse regression anal-ysisrdquo Statistics vol 37 no 6 pp 475ndash488 2003

[15] P Hall and K-C Li ldquoOn almost linearity of low-dimensionalprojections from high-dimensional datardquo The Annals of Statis-tics vol 21 no 2 pp 867ndash889 1993

[16] G He H-G Muller and J-L Wang ldquoFunctional canonicalanalysis for square integrable stochastic processesrdquo Journal ofMultivariate Analysis vol 85 no 1 pp 54ndash77 2003

[17] L Ferre and A Yao ldquoSmoothed functional inverse regressionrdquoStatistica Sinica vol 15 no 3 pp 665ndash683 2005

[18] W Zhong P Zeng P Ma J S Liu and Y Zhu ldquoRSIR regular-ized sliced inverse regression for motif discoveryrdquo Bioinformat-ics vol 21 no 22 pp 4169ndash4175 2005

[19] L Ferre and N Villa ldquoMultilayer perceptron with functionalinputs an inverse regression approachrdquo Scandinavian Journalof Statistics vol 33 no 4 pp 807ndash823 2006

[20] L Li andX Yin ldquoSliced inverse regressionwith regularizationsrdquoBiometrics vol 64 no 1 pp 124ndash131 2008

[21] C Bernard-Michel L Gardes and S Girard ldquoGaussian regular-ized sliced inverse regressionrdquo Statistics and Computing vol 19no 1 pp 85ndash98 2009

[22] T Hsing and R J Carroll ldquoAn asymptotic theory for slicedinverse regressionrdquo The Annals of Statistics vol 20 no 2 pp1040ndash1061 1992

[23] J Saracco ldquoAn asymptotic theory for sliced inverse regressionrdquoCommunications in Statistics vol 26 no 9 pp 2141ndash2171 1997

[24] L X Zhu and K W Ng ldquoAsymptotics of sliced inverse regres-sionrdquo Statistica Sinica vol 5 no 2 pp 727ndash736 1995

[25] T Kurita and T Taguchi ldquoA kernel-based fisher discriminantanalysis for face detectionrdquo IEICE Transactions on Informationand Systems vol 88 no 3 pp 628ndash635 2005

[26] K Fukumizu F R Bach and M I Jordan ldquoKernel dimensionreduction in regressionrdquo The Annals of Statistics vol 37 no 4pp 1871ndash1905 2009

[27] T Kato Perturbation Theory for Linear Operators SpringerBerlin Germany 1966

[28] F Chatelin Spectral Approximation of Linear Operators Aca-demic Press 1983

[29] G Blanchard O Bousquet and L Zwald ldquoStatistical propertiesof kernel principal component analysisrdquoMachine Learning vol66 no 2-3 pp 259ndash294 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 11: Research Article Kernel Sliced Inverse Regression ...downloads.hindawi.com/journals/aaa/2013/540725.pdf · high-dimensional space [ ]. is conforms to the argument in [ ] that the

Abstract and Applied Analysis 11

[12] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society of London A vol 209 no 441ndash458 pp 415ndash446 1909

[13] H Konig Eigenvalue Distribution of Compact Operators vol16 of Operator Theory Advances and Applications CH BaselSwitzerland 1986

[14] L Ferre and A Yao ldquoFunctional sliced inverse regression anal-ysisrdquo Statistics vol 37 no 6 pp 475ndash488 2003

[15] P Hall and K-C Li ldquoOn almost linearity of low-dimensionalprojections from high-dimensional datardquo The Annals of Statis-tics vol 21 no 2 pp 867ndash889 1993

[16] G He H-G Muller and J-L Wang ldquoFunctional canonicalanalysis for square integrable stochastic processesrdquo Journal ofMultivariate Analysis vol 85 no 1 pp 54ndash77 2003

[17] L Ferre and A Yao ldquoSmoothed functional inverse regressionrdquoStatistica Sinica vol 15 no 3 pp 665ndash683 2005

[18] W Zhong P Zeng P Ma J S Liu and Y Zhu ldquoRSIR regular-ized sliced inverse regression for motif discoveryrdquo Bioinformat-ics vol 21 no 22 pp 4169ndash4175 2005

[19] L Ferre and N Villa ldquoMultilayer perceptron with functionalinputs an inverse regression approachrdquo Scandinavian Journalof Statistics vol 33 no 4 pp 807ndash823 2006

[20] L Li andX Yin ldquoSliced inverse regressionwith regularizationsrdquoBiometrics vol 64 no 1 pp 124ndash131 2008

[21] C Bernard-Michel L Gardes and S Girard ldquoGaussian regular-ized sliced inverse regressionrdquo Statistics and Computing vol 19no 1 pp 85ndash98 2009

[22] T Hsing and R J Carroll ldquoAn asymptotic theory for slicedinverse regressionrdquo The Annals of Statistics vol 20 no 2 pp1040ndash1061 1992

[23] J Saracco ldquoAn asymptotic theory for sliced inverse regressionrdquoCommunications in Statistics vol 26 no 9 pp 2141ndash2171 1997

[24] L X Zhu and K W Ng ldquoAsymptotics of sliced inverse regres-sionrdquo Statistica Sinica vol 5 no 2 pp 727ndash736 1995

[25] T Kurita and T Taguchi ldquoA kernel-based fisher discriminantanalysis for face detectionrdquo IEICE Transactions on Informationand Systems vol 88 no 3 pp 628ndash635 2005

[26] K Fukumizu F R Bach and M I Jordan ldquoKernel dimensionreduction in regressionrdquo The Annals of Statistics vol 37 no 4pp 1871ndash1905 2009

[27] T Kato Perturbation Theory for Linear Operators SpringerBerlin Germany 1966

[28] F Chatelin Spectral Approximation of Linear Operators Aca-demic Press 1983

[29] G Blanchard O Bousquet and L Zwald ldquoStatistical propertiesof kernel principal component analysisrdquoMachine Learning vol66 no 2-3 pp 259ndash294 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 12: Research Article Kernel Sliced Inverse Regression ...downloads.hindawi.com/journals/aaa/2013/540725.pdf · high-dimensional space [ ]. is conforms to the argument in [ ] that the

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of