On splines approximation for sliced average variance estimation

Journal of Statistical Planning and Inference 139 (2009) 1493 -- 1505

Contents lists available at ScienceDirect

Journal of Statistical Planning and Inference

journal homepage: www.e lsev ier .com/ locate / jsp i

On splines approximation for sliced average variance estimation�

Zhou Yua, Li-Ping Zhua, Li-Xing Zhua,b,∗

aEast China Normal University, Shanghai, ChinabHong Kong Baptist University, Hongkong, China

A R T I C L E I N F O A B S T R A C T

Article history:Received 27 April 2006Received in revised form18 July 2008Accepted 25 July 2008Available online 20 August 2008

Keywords:Asymptotic normalityB-splineBayes information criterionDimension reductionSliced average variance estimationStructural dimension

To avoid the inconsistency and slow convergence rate of the slicing estimator of the slicedaverage variance estimation (SAVE), particularly in the continuous response cases, we sug-gest B-spline approximation that can make the estimator

√n consistent and keeps the spirit

of easy implementation that the slicing estimation shares. Compared with kernel estimationthat has been used in the literature, B-spline approximation is of higher accuracy and is eas-ier to implement. To estimate the structural dimension of the central dimension reductionspace, a modified Bayes information criterion is suggested, which makes the leading term andthe penalty term comparable in magnitude. This modified criterion can help to enhance theefficacy of estimation. The methodologies and theoretical results are illustrated through anapplication to the horse mussel data and simulation comparisons with existing methods bysimulations.

© 2008 Elsevier B.V. All rights reserved.

1. Introduction

Consider the regression of a one-dimensionalresponse Y on a p-dimensional predictor vector X= (X1, . . . ,Xp)T where T denotes

the transpose operator.When p becomes large, thewell-known curse of dimensionality causes data sparseness, and the estimationof the regression function suffers from it badly. Recent developments to circumvent this issue focus on semiparametric models,such as multi-index models in which Y is related to X through K linear combinations of X, say BTX where B is a p× K matrix. Thisis equivalent to say that Y is independent of X when BTX is given. When K is much smaller than p, this model achieves the goal ofdimension reduction, and the column space of B is called a dimension reduction subspace. As B is not unique, Cook (1998) definedthe central dimension reduction (CDR) subspace that is the intersection of all the dimensional reduction subspaces satisfying thisconditional independence, denoted by SY|X . Cook (1996, 1998) showed that the CDR subspace exists under somemild conditions.

Denote the positive definite covariance matrix of X by �. When the standardized version of X, Z = �−1/2(X − E(X)), is used, theCDR subspace SY|Z = �1/2SY|X (see Cook, 1998, Chapters 10, 11). That is, the CDR subspace based on the standardized variable Zcan be easily transformed back to that based on the xoriginal predictor vector X. Therefore, we shall use the standardized variableZ throughout this paper..

Sliced inverse regression (SIR) (Li, 1991) and sliced average variance estimation (SAVE) (Cook and Weisberg, 1991) aretwo promising tools that can be used to recover the CDR subspace. SIR estimates the CDR subspace through the eigenvectors

� The first author was supported by National Social Science Foundation of China (08CTJ001), and the second author was supported by a scholarship underthe State Scholarship Fund ([2007]3020). The third author was supported by a RGC grant from the Research Grants Council of Hong Kong, Hong Kong, China.The authors thank the editor, the associate editor and the referees for their constructive comments and suggestions that led to a significant improvement inpresentation of the early draft.

∗ Corresponding author at: Hong Kong Baptist University, Hongkong, China.E-mail address: [email protected] (L.-X. Zhu).

0378-3758/$ - see front matter © 2008 Elsevier B.V. All rights reserved.doi:10.1016/j.jspi.2008.07.017

http://www.sciencedirect.com/science/journal/jspi

http://www.elsevier.com/locate/jspi

mailto:[email protected]

1494 Z. Yu et al. / Journal of Statistical Planning and Inference 139 (2009) 1493 -- 1505

that are associated with the non-zero eigenvalues of the candidate matrix Cov[E(Z|Y)], while SAVE uses the candidate matrixE[Ip − Cov(Z|Y)]2 where Ip is the p × p identity matrix. When SIR is used, we need the following linearity condition:

E(Z|PSY|Z Z) = PSY|Z Z, (1.1)

and for SAVE, we need in addition the constant variance condition:

Cov(Z|PSY|Z Z) = Ip − PSY|Z , (1.2)

where P(·) stands for the projection operator based on the standard inner product (Li, 1991; Cook, 1998).How to estimate the candidate matrices of SIR and of SAVE based on the observations is naturally of importance. There are

two proposals in the literature: slicing estimation and kernel smoothing. Slicing estimation proposed by Li (1991) has becomeone of the standard methods in this area. The idea of slicing estimation is to divide the whole space of Y into several slices, andthen to estimate the SIR matrix through the average of the covariance matrices of Z in each slice. Zhu and Ng (1995) establishedthe asymptotic normality of the slicing estimation of the SIR matrix when the number of slices ranges from

√n to n/2. Note that

the slicing algorithm is easy to implement in practice and can also be applied to estimate the SAVE matrix (Cook and Weisberg,1991). However, further study demonstrates that the asymptotic behavior of the slicing estimator of SAVE is very different fromthat of SIR. Unlike SIR which is insensitive to the number of slices, Cook (2000) pointed out that the number of slices plays therole of smoothing parameter and thus SAVE may be affected by this choice. Zhu et al.'s (2007) empirical study also indicatedthat the slicing estimator of SAVE is sensitive to the selection of the number of slices. A rather surprising finding was obtained inLi and Zhu (2007) who provided a systematic study on the convergence of the slicing estimator of the SAVE matrix. A rigorousproof shows that when Y is continuous, the estimator cannot be

√n consistent, and when each slice contains fixed number of

data points not depending on n, it is even inconsistent. This observation motivates us to consider alternative methods to slicingestimation. Clearly, any sophisticated nonparametric smoothing can be an alternative. Zhu and Zhu (2007) suggested usingkernel estimation following the idea of Zhu and Fang (1996) for SIR. Although the kernel estimator can achieve good theoreticalproperties, its implementation is much more difficult than the slicing estimator. To keep the computational efficacy and

√n

consistency simultaneously, in the first part of this paper we suggest B-spline approximation due to its nature of least squaresand parametrization. The asymptotic normality can be achieved for a wide range of knots. Compared with the kernel method,the B-spline approximation gains higher efficiency, especially when the sample size is relatively small.

To determine the structural dimension of the CDR subspace, the sequential test proposed by Li (1991) has received muchattention.However, this test greatly dependson the asymptotic normality of related estimator and its covariancematrix. In Section2, we will see that although the asymptotic normality could be achieved by B-spline approximation, the covariance matrix hasa very complicated structure, which is always the case when estimating the SAVE matrix. The sequential test procedure thussuffers from a complicated plug-in estimation. Zhu et al. (2006) proposed a BIC type criterion to estimate the structural dimensionwhen SIR is applied. For SAVE, Zhu and Zhu (2007) suggested the Bayes information criterion (BIC) to consistently estimate thedimension when the kernel estimation is employed. Our empirical studies suggest that, when the sample size is relatively small,the estimation of the structural dimension under this criterion tends to have more variation than we expect. We address thisproblem in the second part of our paper. We modify the BIC suggested in Zhu and Zhu (2007) by rescaling the leading term andthe penalty term tomake them comparable inmagnitude. The empirical study shows that after such an adjustment this modifiedBIC type criterion achieves a higher accuracy. We also prove the consistency of the estimator of the structural dimension.

The rest of the paper is organized as follows. The asymptotic results of the B-spline approximation are presented in Section 2. InSection 3, we discuss the estimation of the structural dimension. Section 4 reports a simulation study to evidence the efficiency ofB-spline approximationand themodifiedBIC type criterion.Weuse thehorsemussel data to showthat the convergenceof B-splineapproximation holds for a wide range of knots. It is also shown that our proposed BIC type criterion in estimating the structuraldimension is very insensitive to the choice of the number of knots. All the technical details are postponed to the Appendix.

2. Asymptotic behavior of the splines approximation

Let Z and its independent copies zj be

Z = (Z1, . . . , Zp)T, zj = (z1j, . . . , zpj)

T, j = 1, . . . ,n,

and define A2 = AA for any squared symmetric matrix A. Then the population version of the SAVE matrix is

� = E(Ip − Cov(Z|Y))2 = Ip − 2E(Cov(Z|Y)) + E(Cov(Z|Y))2.For notational simplicity, let Rkl(y) = E(ZkZl|Y = y) and rk(y) = E(Zk|Y = y). let �kl = 1 if k = l, and �kl = 0 otherwise for 1�k, l�p.Then the kl-th element �kl of � can be written as

�kl = �kl − 2E(Rkl(Y) − rk(Y)rl(Y)) + E

⎛⎝ p∑h=1

(Rkh(Y)Rhl(Y) − Rkh(Y)rh(Y)rl(Y) − rk(Y)rh(Y)Rhl(Y) + rk(Y)rl(Y)r2h(Y))

⎞⎠ .

Z. Yu et al. / Journal of Statistical Planning and Inference 139 (2009) 1493 -- 1505 1495

Now we introduce the B-spline approximation to estimate the involved conditional expectations. A spline is defined as apiecewise polynomial that is smoothly connected at its knots. Specifically, for any fixed integerm>1, denote S(m, t) to be the setof spline functions with knots t = {0 = t0 < t1 < · · · < tk0+1 = 1}, then for m�2, S(m, t) = {s ∈ Cm−2[0, 1] : s(y) is a polynomial ofdegree (m− 1) on each subinterval [ti, ti+1]}. The common choices ofm are two for linear splines, three for quadratic splines andfour for cubic splines. Here k0 is referred as the number of internal knots. It is convenient to express elements in S(m, t) in termsof B-spline. For any fixedm and t, let

Ni,m(y) = (ti − ti−m)[ti−m, . . . , ti](t − y)m−1+ , i = 1, . . . , J = k0 + m,

where [ti−m, . . . , ti]g denotes themth-order divided difference of the function g and ti = tmin(max(i,0),k0+1) for any i= 1−m, . . . , J.

Then {Ni,m(·)}Ji=1 form a basis for S(m, t). See Schumaker (1981, p. 124). That is, for any s(y) ∈ S(m, t), there exists an � such that

s(y)=�TNm(y), where Nm(y)= (N1,m(y), . . . ,NJ,m(y))T. For notational convenience, in the sequel, Nm(·) will be abbreviated as N(·).Consider the spline approximation of the condition expectation, say, rj(y), based on the samples. The estimator of orderm for

rj(y) is defined to be the least squares minimizer rj(y) ∈ S(m, t) corresponding to

n∑i=1

(zji − rj(yi))2 = min

sj(y)∈S(m,t)

n∑i=1

(zji − sj(yi))2.

Write GJ,n = (1/n)∑n

j=1N(yj)NT(yj) and its expectation G(q) = E(N(Y)NT(Y)). The B-spline approximation of rk(y) is given by

rk(y)=NT(y)(nGJ,n)−1∑n

l=1N(yl)zkl. In a similar fashion, theB-spline approximation ofRkl(y) is estimatedby Rkl(y)=NT(y)(nGJ,n)−1∑n

i=1N(yi)zkizli. Therefore, the kl-th element �kl in the B-spline approximation � can be written as, through replacing theunknowns by their estimators,

�kl = �kl −2n

n∑j=1

(Rkl(yj) − rk(yj)rl(yj)) + 1n

n∑j=1

p∑h=1

(Rkh(yj)Rhl(yj) − Rkh(yj)rh(yj)rl(yj) − rk(yj)rh(yj)Rhl(yj)

+ rk(yj)rl(yj)r2h(yj)). (2.1)

We adopt matrix vectorization before presenting the main theorem. For a symmetric (p × p) matrix C = (ckl)p×p, let Vech(C) =(c11, . . . , cp1, c22, . . . , cp2, c33, . . . , cpp) be a p(p + 1)/2 dimensional vector. Then C and Vech(C) have a one to one correspondence.

We are now in the position to introduce the theoretical results. Define the kl-th element of matrix H(Z,Y) as

Hkl(Z,Y) = − 2 × (ZkZl + Rkl(Y) − Zkrl(Y) − Zlrk(Y) − rk(Y)rl(Y) − 2E(Rkl(Y)) + 3E(rk(Y)rl(Y))) +p∑

h=1

(ZhZlRhl(Y) + ZkZhRhl(Y)

+ Rkh(Y)Rhl(Y) − 3E(Rkh(Y)Rhl(Y)) − ZkZhrh(Y)rl(Y) − Zhrl(Y)Rkh(Y)

− Zlrh(Y)Rkh(Y) − Rkh(Y)rl(Y)rh(Y) − ZlZhrh(Y)rk(Y) − Zhrk(Y)Rlh(Y)

− Zkrh(Y)Rlh(Y) − Rlh(Y)rk(Y)rh(Y) + r2h(Y)rk(Y)rl(Y) + Zlr2h(Y)rk(Y)

+ Zkr2h(Y)rl(Y) + 2Zhrh(Y)rl(Y)rk(Y) + 4(Rkh(Y)rl(Y)rh(Y)) + 4E(Rlh(Y)rk(Y)rh(Y)) − 5E(r2h(Y)rk(Y)rl(Y))). (2.2)

The asymptotic normality is stated in the following theorem.

Theorem 1. In addition to (1.1) and (1.2), assume that conditions (1)–(4) in Appendix A.1 hold. Then as n → ∞, we have

√n(� − �) → H(Z,Y) in distribution, (2.3)

where Vech(H(Z,Y)) is distributed as N(0,Cov(Vech(H(Z,Y)))).

From Theorem 1, we can derive the asymptotic normality of the eigenvalues and their corresponding normalized eigen-vectors by using perturbation theory. The following result is parallel to that of SIR obtained by Zhu and Fang (1996) andZhu and Ng (1995).

Let �1(A)��2(A)� · · · ��p(A)�0 and bi(A) = (b1i(A), . . . , bpi(A))T, i = 1, . . . ,p, denote, respectively, the eigenvalues and their

corresponding normalized eigenvectors of a p × p matrix A.


Theorem 2. In addition to the conditions of Theorem 1, assume that the nonzero �l(�)′s are distinct. Then for each nonzero eigenvalue�i(�) and the corresponding normalized eigenvector bi(�), we have

√n(�i(�) − �i(�)) = √

nbi(�)T(� − �)bi(�) + op(√n‖� − �‖)

= bi(�)THbi(�) + op(√n‖� − �‖), (2.4)

where H is given in Theorem 1, and as n → ∞,

√n(bi(�) − bi(�)) = √

np∑

l=1,l� i

bi(�)bi(�)T(� − �)bi(�)�j(�) − �l(�)

+ op(√n‖� − �‖)

=p∑

l=1,l� i

bi(�)bi(�)THbi(�)�j(�) − �l(�)

+ op(√n‖� − �‖), (2.5)

where ‖� − �‖ = ∑1� i,j�p|aij|.

3. Estimation of the structural dimension

When we use SAVE, sequential tests for structural dimension estimation may be inappropriate due to the following reasons:asymptotic normality does not holdwhen slicing estimation is used, and the plug-in estimation for the limiting covariancematrixof the estimator is difficult to implementwhen kernel or splines approximation is applied. Zhu et al. (2006) built a bridge betweenthe classical model selection criteria BIC and SIR to overcome the problem. Their proposed modified BIC type criterion can beextended to handle the SAVE case since consistency of their structural dimension estimator only requires convergence of theeigenvalues of the associated matrix.

Let � = � + Ip and � = � + Ip. Recall the definition of �i(A). Clearly, �i(�) = �i(�) + 1. Determining the structural dimensionnow becomes estimation of K, the number of the eigenvalues of � being greater than 1. Let � denote the number of �i(�) greaterthan 1. Zhu and Zhu (2007) define the criterion by

V(k) = n2

p∑i=1+min(�,k)

(ln�i(�) + 1 − �i(�)) + (p − k) ln(n). (3.1)

Then the estimator of the structural dimension K is defined as the maximizer K of V(k) over k ∈ {0...p − 1}. Since p ln(n) does notaffect the estimation, we can simply ignore this term. Based on our empirical studies, this criterion tends to give an incorrectestimation of the structural dimension with a relatively large probability when the sample size is small. This might be dueto the fact that the penalty term in (3.1) is sometimes problematic. Since the eigenvalues and the dimension p may not becomparable in magnitude, we modify this criterion by rescaling both the leading term and the penalty term and define G(·) asfollows:

G(k) = n

∑ki=1(ln�i(�) + 1 − �i(�))∑pi=1(ln�i(�) + 1 − �i(�))

− 2 ln(n)k2

p. (3.2)

Note that we divide∑k

i=1(ln�i(�)+1−�i(�)) by∑p

i=1(ln�i(�)+1−�i(�)) to make it not exceed one. We also multiply 2k ln(n)

by k/p which does not exceed one either. The estimator of K is defined as the maximizer K of G(k) over k ∈ {1, . . . , p}, that is,

G(K) = max1�k�p

G(k). (3.3)

The estimator through this modified BIC type criterion inherits the convergence property, as stated in the following theorem.

Theorem 3. Under the conditions of Theorem 1, K converges to K in probability.

Not surprisingly, rescaling of the leading term and the penalty term results in better estimating accuracy, as we will see in thefollowing section.

4. Illustrative examples

In this section, we demonstrate our methodologies and theoretical results through the horse mussel dataset and a simulationstudy. As in other B-spline approximation settings, the first concern is the choice of the number of knots and their locations. Thegeneralized cross-validation (GCV) is a frequently used method. However, GCV cannot be applied directly because we need a


Table 1The empirical R2 with n = 400.

Model 1 Model 2 Model 3 Model 4

SAVE B-spline (2 = 1) 0.9886 0.9870 0.9758 0.9645SAVE Kernel (2 = 1) 0.9837 0.9824 0.9716 0.9498SAVE B-spline (2 = 1.2) 0.9882 0.9942 0.9819 0.9650SAVE Kernel (2 = 1.2) 0.9865 0.9863 0.9840 0.9332SAVE B-spline (2 = 1.4) 0.9702 0.9689 0.9640 0.9522SAVE Kernel (2 = 1.4) 0.9697 0.9665 0.9760 0.9308SAVE B-spline (2 = 1.6) 0.9704 0.9687 0.9598 0.9444SAVE Kernel (2 = 1.6) 0.9645 0.9546 0.9557 0.9201SAVE B-spline(2 = 1.8) 0.9760 0.9669 0.9648 0.9305SAVE Kernel (2 = 1.8) 0.9602 0.9424 0.9449 0.9106

Table 2The empirical R2 with n = 200.

Model 1 Model 2 Model 3 Model 4

SAVE B-spline (2 = 1) 0.9759 0.9780 0.9743 0.9530SAVE Kernel (2 = 1) 0.9724 0.9750 0.9565 0.9313SAVE B-spline (2 = 1.2) 0.9614 0.9643 0.9644 0.9497SAVE Kernel (2 = 1.2) 0.9401 0.9511 0.9434 0.9106SAVE B-spline (2 = 1.4) 0.9663 0.9623 0.9522 0.9391SAVE Kernel (2 = 1.4) 0.9280 0.9348 0.9223 0.9002SAVE B-spline (2 = 1.6) 0.9541 0.9481 0.9546 0.9349SAVE Kernel (2 = 1.6) 0.8908 0.9046 0.9058 0.8958SAVE B-spline (2 = 1.8) 0.9479 0.9388 0.9451 0.9212SAVE Kernel (2 = 1.8) 0.8647 0.8891 0.8866 0.8534

large number of knots to get an undersmoothing estimator for the regression curve. See Stute and Zhu (2005). Then we suggest asemidata driven algorithm, which is easy to implement. Write k0n the number of knots selected by GCV. We then choose kn to bethe integer closest to k0n × n2/15. Use ti as the i/kn-th quantile of the observed values of the response so that they are uniformlyscattered in the range.

4.1. Simulation study

In this subsection we will compare the performance of B-spline approximation with the kernel methods. The efficiency ofour proposed BIC type criterion for determining the structural dimension is also shown here. The trace correlation coefficientR2 = trace(PBPB)/K proposed by Ferre (1998) is adopted to measure the distance between the estimated CDR subspace span(B)and the true CDR subspace span(B). See Ferre (1998) for more details. The data are generated from the following four modelswith n = 200 and 400:

Model 1: y = (Tx)2 + �;

Model 2: y = (Tx)4 + �;

Model 3: y = (T1x)

4 + (T2x)

2 + �;

Model 4: y = (T1x)

2 + (T2x)

2 × �.

In these models, the predictor x and the error � are independent, and follow, respectively, normal N(0, I10) and N(0,2),where 2 = 1, 1.2, 1.4, 1.6, and 1.8. In the simulations, = (1, 1, 0, 0, . . . , 0)T for models 1 and 2, and 1 = (1, 0, 0, 0, . . . , 0)T, 2 =(0, 1, 0, 0, . . . , 0)T for models 3 and 4.

Zhu and Zhu (2007) showed the superiority of kernel estimation to the slicing estimation with n = 400 and 2 = 1. Tomake a comprehensive comparison, we report in Tables 1 and 2 the median of R2, evaluating the performance of the B-splineapproximation and the kernel estimation with different n and 2, respectively, based on 100 data replications. For the kernelestimation, we choose the resulting bandwidth hfinal suggested in Zhu and Zhu (2007), and kn as the working number of knotsfor the B-spline approximation. As can be seen from Table 1, when the sample size n is large, both methods work well. Thisphenomenon verifies their consistency for large sample size. Table 2 indicates that the B-spline approximation outperformsthe kernel method for a moderate sample size of 200. The kernel smoothing deteriorates quickly with increasing 2 while theB-spline approximation keepsmuch higher accuracy.We also tried different numbers of knots from 2 to 6 for sample size 200 and400. We found that the values of R2 did not change much and so we will not report the results in this paper. This phenomenonindicates that for B-spline approximation of SAVE, the choice of the number of knots is not crucial.


Table 3The frequency of the decisions of dimension with n = 400 when criterion G(·) is used.

D = 1 D = 2 D = 3 D = 4 D = 5 D = 6 D = 7 D = 8 D = 9 D = 10

Model 1 1 0 0 0 0 0 0 0 0 0Model 2 1 0 0 0 0 0 0 0 0 0Model 3 0 1 0 0 0 0 0 0 0 0Model 4 0.020 0.980 0 0 0 0 0 0 0 0

D stands for dimension.

Table 4The frequency of the decisions of dimension with n = 400 when criterion V(·) is used.

D = 0 D = 1 D = 2 D = 3 D = 4 D = 5 D = 6 D = 7 D = 8 D = 9

Model 1 0 0.970 0.030 0 0 0 0 0 0 0Model 2 0 0.985 0.015 0 0 0 0 0 0 0Model 3 0 0.080 0.915 0.005 0 0 0 0 0 0Model 4 0 0.050 0.950 0 0 0 0 0 0 0


Table 5The frequency of the decisions of dimension with n = 200 when criterion G(·) is used.

D = 1 D = 2 D = 3 D = 4 D = 5 D = 6 D = 7 D = 8 D = 9 D = 10

Model 1 1 0 0 0 0 0 0 0 0 0Model 2 0.980 0.010 0.010 0 0 0 0 0 0 0Model 3 0.080 0.920 0 0 0 0 0 0 0 0Model 4 0.060 0.940 0 0 0 0 0 0 0 0


Table 6The frequency of the decisions of dimension with n = 200 when criterion V(·) is used.

D = 0 D = 1 D = 2 D = 3 D = 4 D = 5 D = 6 D = 7 D = 8 D = 9

Model 1 0 0.915 0.085 0 0 0 0 0 0 0Model 2 0 0.930 0.070 0 0 0 0 0 0 0Model 3 0 0.080 0.920 0 0 0 0 0 0 0Model 4 0 0.105 0.895 0 0 0 0 0 0 0


Now let us investigate the efficiency of the two different criterions: V(·) proposed in Zhu and Zhu (2007) and G(·) in this paper,for determining the structural dimension. Two hundred Monte Carlo samples are generated for n = 200 and 400 with 2 = 1.

From Tables 3 and 4 we can see that both criterions perform well when the sample size is large. When the sample size ismoderate, Tables 5 and 6 indicate that the criterion V(·) tends to misspecify the true structural dimension with a much higherprobability. This empirical study implies that the adjustment in V(·) does help enhance the efficacy by rescaling the leading termand the penalty term.

4.2. Horse mussel data

This dataset consists of a sample of 201 measurements collected from five sites in the Marlborough Sounds in December1984, which were located off the north-east coast of New Zealand's South Island. The response variable is the muscle mass M,the edible portion of the mussel, in grams. The quantitative predictors are the shell widthW in millimeters, the shell height H inmillimeters, the shell length L inmillimeters and the shellmass S in grams. The observationswere assumed to be independent andidentically distributed. See also the description in Cook (1998). Bura and Cook (2001) pointed out that the transformed variablesln(W) and ln(S) should be used in place of W and S to satisfy linearity condition (1.1) and constant variance condition (1.2). Ouranalysis is also based on this transformed dataset. The quadratic B-spline approximation is applied. First, the G(k) values of (3.1)are reported in Fig. 1 to determine the structural dimension, where the knots are selected as suggested at the beginning of thissection. According to Fig. 1, a proper estimation of the structural dimension should be K=1whenwe adopt the BIC type criterion.This complies with Bura and Cook's (2001) finding. However, the sequential test gives different results when applying SIR: itestimates the structural dimension to be 1 or 2, depending on the number of slices (Bura and Cook, 2001). We also try severalnumbers of knots as kn = 2, 3, 4, 5, and all agreed with this conclusion. For this dataset, our decision of the structural dimensionis insensitive to the choice of knots.


1 1.5 2 2.5 3 3.5 4−45

−40

−35

−30

−25

−20

−15

−10

−5

0

k

G (k

)

Fig. 1. The G(k) values versus k for horse mussel data.

−200 −150 −100 −500

10

20

30

40

50

60

β0 XT

M

−10 −8 −6 −40

10

20

30

40

50

60

βT X

M

Fig. 2. The scatter plots of M versus the projected directions.

To further study the relationship between the response M and the estimated direction, we also report the scatter plot of Min Fig. 2 against the combination TX and T

0X, respectively, where X = (ln(W),H, L, ln(S))T, the direction = (−0.5250,−0.0360,−0.0171, 0.6272)T was identified by B-spline approximation with five knots and 0 = (0.028,−0.029,−0.593, 0.804)T was sug-gested by Bura and Cook (2001). Similar trends can be beenseen from the two scatter plots with only the scale difference of thehorizontal axis. The correlation coefficient between T

0X and TX is 0.9763 and so the scale difference is not crucial for furthernonparametric statistical modeling.

Appendix A.

A.1. Assumptions

The following four conditions are required for Theorems 1–3.

(1) Rkl(y) = E(ZkZl|Y = y) ∈ Cm[0, 1] and ri(y) = E(Zi|Y = y) ∈ Cm[0, 1], for 1�k, l�p;(2) E‖Zk‖4 <∞, for all 1�k�p;


(3) The B-spline function N(·) satisfies: max1� i�k0|hi+1 − hi| = o(k−1

0 ) and h/min1� i�k0hi�M where hi = ti − ti−1,

h = max1� i�k0hi and M>0 is a predetermined constant;

(4) As n → ∞,h ∼ n−c1 with positive numbers c1 satisfying 12m < c1 <

12 , and the notation ∼ means the two quantities have the

same convergence order.

Remark A.1. Condition (1) is concerned with the smoothness of the inverse regression curve E(Z|Y = y), which is similar tothe conditions used in Lemma 4 of Bura (2003). Condition (2) is necessary for the asymptotic normality of �. Condition (3) isusually used in the splines approximation. Such an assumption assures that M−1 < k0h<M, which is necessary for numericalcomputations. Condition (4) restricts the range of number of knots for asymptotic normality. Clearly, although it is fairly wide,undersmoothing is needed because the optimal number of knots O(n1/(2m+1) is not within this range. This phenomenon isessentially the same as the kernel estimation in Zhu and Fang (1996).

A.2. Some lemmas

Since the Proof of Theorem 1 is rather long, we split the main steps into several lemmas. The following lemmas present theresults that the elements of � can be written as U-statistics and then can be approximated by sums of i.i.d. random variables.

Lemma A.1. Suppose conditions (1)–(4) satisfied. Then

1√n

n∑j=1

(T1(yj) − T1(yj))(T2(yj) − T2(yj)) = op(1),

where both T1(·) and T2(·) can be rk(·) and Rkl(·) for each pair 1� , k, l�p.

Lemma A.2. Suppose conditions (1)–(4) are satisfied. Then

1√n

n∑j=1

(T(yj)rk(yj) − T(yj)rk(yj)) = 1√n

n∑j=1

(zkjT(yj) − E(T(Y)rk(Y))) + op(1),

where T(·) can be rl(·), rl(·)r2h(·), rk(·)rl(·)rh(·) and rh(·)Rhl.

Lemma A.3. Suppose conditions (1)–(4) are satisfied. Then

1√n

n∑j=1

(T(yj)Rkh(yj) − T(yj)Rkh(yj)) = 1√n

n∑j=1

(zkjzhjT(yj) − E(T(Y)Rkh(Y))) + op(1),

where T(·) can be 1, Rhl(·) and rk(·)rh(·).

The proof of these lemmas is postponed to Appendix A.4.

A.3. Proof of theorems

Proof of Theorem 1. We need only to deal with the klth element �kl of �. We will divide the proof into five steps. In each step,our major target is to approximate each term by a sum of a sequence of i.i.d random variables. Without confusion, we writeE(·|Y=y)=E(·|y) and E(·|Y=y)= E(·|y) throughout the proof. The proof can be done through the asymptotic linear representationsof U-statistics in the lemmas.

Step 1: By Lemma A.3, we have

1√n

n∑j=1

(Rkl(yj) − ERkl(Y)) = 1√n

n∑j=1

(zkjzlj + Rkl(yj) − 2ERkl(Y)) + op(1). (A.1)

Step 2: Using Lemma A.1, we obtain

1√n

n∑j=1

rk(yj)rl(yj) = 1√n

n∑j=1

(rk(yj) − rk(yj) + rk(yj))(rl(yj) − rl(yj) + rl(yj))

= 1√n

n∑j=1

rl(yj)(rk(yj) − rk(yj)) + 1√n

n∑j=1

rk(yj)(rl(yj) − rl(yj)) + 1√n

n∑j=1

rk(yj)rl(yj) + op(1).


By Lemma A.2, we have

1√n

n∑j=1

(rk(yj)rl(yj) − Erk(Y)rl(Y)) = 1√n

n∑j=1

(zkjrl(yj) + zljrk(yj) + rk(yj)rl(yj) − 3Erk(Y)rl(Y)) + op(1). (A.2)

Step 3: By Lemma A.1, we have

1√n

n∑j=1

Rkh(yj)Rhl(yj) = 1√n

n∑j=1

(Rkh(yj) − Rkh(yj))Rhl(yj)

+ 1√n

n∑j=1

(Rhl(yj) − Rhl(yj))Rkh(yj) + 1√n

n∑j=1

Rkh(yj)Rhl(yj) + op(1).

Therefore, Lemma A.3 yields

1√n

n∑j=1

(Rkh(yj)Rhl(yj) − E(Rkh(Y)Rhl(Y))) = 1√n

n∑j=1

(zhjzljRkh(yj) + zkjzhjRhl(yj) + Rkh(yj)Rhl(yj)

− 3E(Rkh(Y)Rhl(Y))) + op(1). (A.3)

Step 4: Use Lemma A.1 again to obtain

1√n

n∑j=1

(r2h(yj)rk(yj)rl(yj)) = 1√n

n∑j=1

(r2h(yj)rl(yj)(rk(yj) − rk(yj)) + 2rk(yj)rl(yj)rh(yj)(rh(yj) − rh(yj))

+ r2h(yj)rk(yj)(rl(yj) − rl(yj)) + r2h(yj)rk(yj)rl(yj)) + op(1).

By Lemmas A.2 and A.3, we have

1√n

n∑j=1

(r2h(yj)rk(yj)rl(yj) − E(r2h(Y)rk(Y)rl(Y))) = 1√n

n∑j=1

(zljr2h(yj)rk(yj) + zkjr

2h(yj)rl(yj) + 2zhjrh(yj)rl(yj)rk(yj)

+ r2h(yj)rk(yj)rl(yj) − 5E(r2h(Y)rk(Y)rl(Y))) + op(1). (A.4)

Step 5: Invoking Lemma A.1 again, we derive

1√n

n∑j=1

Rhl(yj)rk(yj)rh(yj) = 1√n

n∑j=1

((Rhl(yj) − Rhl(yj))rk(yj)rh(yj) + (rh(yj) − rh(yj))Rhl(yj)rk(yj)

+ (rk(yj) − rk(yj))rh(yj)Rhl(yj) + rk(yj)rh(yj)Rhl(yj)) + op(1).

Therefore, by Lemmas A.2 and A.3, we achieve

1√n

n∑j=1

(Rhl(yj)rk(yj)rh(yj) − E(Rhl(Y)rk(Y)rh(yj))) = 1√n(zhjzljrk(yj)rh(yj) + zhjrk(yj)Rhl(yj) + zkjrk(yj)Rhl(yj)

+ Rhl(yj)rk(yj)rh(yj) − 4E(Rhl(Y)rk(Y)rh(Y))) + op(1). (A.5)

Similarly, we have

1√n

n∑j=1

(Rkh(yj)hk(yj)rl(yj) − E(Rkh(Y)rh(Y)rl(yj))) = 1√n(zkjzhjrh(yj)rl(yj) + zhjrl(yj)Rkh(yj) + zljrh(yj)Rkh(yj)

+ Rkh(yj)rh(yj)rl(yj) − 4E(Rkh(Y)rh(Y)rl(Y))) + op(1). (A.6)

Finally, combining the results of (A.1)–(A.6), we derive that �kl can be written asymptotically as a sum of i.i.d. random variables.Hence, Central Limit Theorem yields the desired result with the variance of (2.2). �


Proof of Theorem 2. The proof is also almost identical to that of Theorem 2 of Zhu and Fang (1996). Hence we omit the detailsof the proof. �

Proof of Theorem 3. Let K be the true value of the dimension of the CDR subspace. As can be seen from Theorem 2, �k(�)−�k(�)= Op(1/

√n) when K > k. Therefore, if K > k,

G(K) − G(k) =⎛⎝n

∑Ki=1(ln�i(�) + 1 − �i(�))∑pi=1(ln�i(�) + 1 − �i(�))

− 2 ln(n)K2

p

⎞⎠ −

⎛⎝n

∑ki=1(ln�i(�) + 1 − �i(�))∑pi=1(ln�i(�) + 1 − �i(�))

− 2 ln(n)k2

p

⎞⎠

= n

∑Ki=k+1(ln�i(�) + 1 − �i(�))∑pi=1(ln�i(�) + 1 − �i(�))

+ 2 ln(n)k2 − K2

p

= n

∑Ki=k+1�

2i (�) + o(�2i (�))∑p

i=1�2i (�) + o(�2i (�))

+ 2 ln(n)k2 − K2

p

→n

∑Ki=k+1�

2i (�)∑p

i=1�2i (�)

+ 2 ln(n)k2 − K2

p>0. (A.7)

If K < k,

G(K) − G(k) =⎛⎝n

∑Ki=1(ln�i(�) + 1 − �i(�))∑pi=1(ln�i(�) + 1 − �i(�))

− 2 ln(n)K2

p

⎞⎠ −

⎛⎝n

∑ki=1(ln�i(�) + 1 − �i(�))∑pi=1(ln�i(�) + 1 − �i(�))

− 2 ln(n)k2

p

⎞⎠

= − n

∑ki=K+1(ln�i(�) + 1 − �i(�))∑pi=1(ln�i(�) + 1 − �i(�))

+ 2 ln(n)k2 − K2

p

= − n

∑ki=K+1�

2i (�) + o(�2i (�))∑p

i=1�2i (�) + o(�2i (�))

+ 2 ln(n)k2 − K2

p

→2 ln(n)k2 − K2

p>0. (A.8)

It follows from (A.7) and (A.8) that K → K in probability. �

A.4. Proof of lemmas

Proof of Lemma A.1. We only need to show that this lemma holds when T1(·) and T2(·) are the same because this lemma can beproven easily when T1(·) and T2(·) are different by the Cauchy inequality. We only prove the case with T1(·)=T2(·)= rk(·) becausethe proof for other cases is essentially the same. Note that

1√n

n∑j=1

(rk(yj) − rk(yj))2 = 1√

n

n∑j=1

r2k (yj) − 21√n

n∑j=1

rk(yj)rk(yj) + 1√n

n∑j=1

r2k (yj).

First, we compute the expectation of the first term. For this, we write it into a U-statistic.

1√n

n∑j=1

r2k (yj) = 1√nn2

n∑j=1

n∑i0=1

n∑j0=1

NT(yj)G−1k,nN(yi0 )Zki0N

T(yj)G−1k,nN(yj0 )Zkj0

= 1√nn2

n∑j=1

n∑i0=1

n∑j0=1

NT(yj)G−1(q)N(yi0 )Zki0N

T(yj)G−1(q)N(yj0 )Zkj0

+ op(1) =√n

C3n

∑i0<j<j0

u1(Zki0 , yi0 , Zkj, yj, Zkj0 , yj0 ) + op(1) =:√nU1 + op(1).


We then compute the expectation of U1 as follows:

EU1 = E(NT(Y2)G−1(q)N(Y1)Zk1N

T(Y3)G−1(q)N(Y3)Zk3)

= E(Zk1NT(Y1)G

−1(q)N(Y2)NT(Y3)G

−1(q)N(Y3)Zk3)

= E(rk(Y1)NT(Y1))E(G

−1(q)N(Y3)rk(Y3))

= E(rk(Y1)2) + O(hm).

Similarly, we have

1√n

n∑j=1

E(rk(yj)rkyj) = E(r2k (Y)) + O(hm).

The conclusion of Lemma A.1 arrives.Since the proof of Lemmas A.2 and A.3 are similar, we only prove Lemma A.2 here. �

Proof of Lemma A.2. Write (1/√n)

∑nj=1T(yj)rk(yj) as a U-statistic:

1√n

n∑j=1

T(yj)rk(yj) = 1√n

n∑j=1

n∑i=1

T(yj)NT(yj)(nGJ,n)

−1N(yi)Zki

=√n

C2n

∑i<j

T(yj)NT(yj)G

−1(q)N(yi)Zki + T(yi)NT(yi)G

−1(q)N(yj)Zkj2

+ op(1)

=:√n

1

C2n

∑i<j

un(zki, yi, zkj, yj) + op(1) = √nUn + op(1), (A.9)

write � = GJ,n − G(q). Lemma 6.4 of Zhou et al. (1998) yields ‖�‖ = o(h). Hence, the second equality holds by invoking the

relationship G−1J,n = G−1(q) − G−1(q)�(I + G−1(q)�)−1G−1(q). The projection of this U-statistic (A.9) is

Un =n∑

j=1

E(Un|zkj, yj) − (n − 1)Eun(Zk1,Y1, Zk2,Y2),

where uh(·) is the kernel of the U-statistic Un. We prove that Un can be asymptotically equal to a sum of i.i.d. random variables.To compute EUn first, we can obtain that

EUn = Eun(Zk1,Y1; Zk2,Y2) = E(T(Y1)NT(Y1)G

−1(q)N(Y2)Zk2))

= E(T(Y1)NT(Y1))E(G

−1(q)N(Y2)rk(Y2))

= E(T(Y)rk(Y)) + O(hm).

The last equality holds by invoking Theorem 2.1 of Zhou et al. (1998). Note that

un1(zk1, y1) =: E(un(zk1, y1; Zk2,Y2)|zk1, y1)

= (T(y1)NT(y1)E(G−1(q)N(Y2)Zk2) + E(T(Y2)NT(Y2)G−1(q)N(y1)Zk1))2

= 12 (T(y1)rk(y1) + zk1T(y1)) + O(hm).

Thus, the centered conditional expectation is as follows:

un1(zk1, y1) = E(un(zk1, y1; Zk2,Y2)|zk1, y1) − E(un(Zk1,Y1; Zk2,Y2))

= 12 (T(y1)rk(y1) + zk1T(y1)) − E(T(Y)rk(Y)) + O(hm).


Following the classical theory of U-statistics (see Serfling, 1980), we can obtain

Un − Eun(Zk1, Zk2,Y1,Y2) = 2n

n∑j=1

un1(zkj, yj)

= 2n

n∑j=1

(12(T(yj)rk(yj) + zkjT(yj)) − E(T(Y)rk(Y))

)+ O(hm). (A.10)

Now we verify that Un can be approximated by its projection Un at a rate 1/√nh, that is,

√n(Un − Un) = Op(1/

√nh). (A.11)

Similar to the Proof of Lemma 5.3 of Zhu and Zhu (2007), we only need to show E(un(Zk1,Y1, Zk2,Y2))2 = O(1/h2), where un(·) is

defined in (A.9). Clearly,

E(un(Zk1,Y1, Zk2,Y2))2� 1

2 ((NT(Y1)G

−1(q)N(Y2)Zk2T(Y1))2 + (NT(Y2)G

−1(q)N(Y1)Zk1T(Y2))2).

It is easy to see that

E(NT(Y1)G−1(q)N(Y2)Zk2T(Y1))

2 = E((T(Y1)2(NT(Y1)G

−1(q)N(Y2))2E(Z2k2|Y2))

= E(T(Y1)2NT(Y1)G

−1(q)N(Y2))NT(Y2)G

−1(q)N(Y1)r2k (Y2))

= trace(E(T(Y1)2G−1(q)N(Y1)N

T(Y1))E(r2k (Y2)G

−1(q)N(Y2)NT(Y2)))

� trace(E‖Z1‖|4E(N(Y2)NT(Y2)G−1(q))2)

= O(1/h2)

and

E(NT(Y2)G−1(q)N(Y1)Zk1T(Y2))

2 = O(1/h2).

The operator trace(A) is the sum of the diagonal elements of matrix A. Therefore, (A.11) holds. It means that√nUn and

√nUn are

asymptotically equivalent.From the above results, we have that, together with (A.10) and (A.11),

1√n

n∑j=1

T(yj)rk(yj) = √nUn + op(1) = √

nUn + op(1)

= 1√n

n∑j=1

(T(yj)rk(yj) + E(T(Y)NT(Y))G−1(q)N(yj)zkj) − √nE(T(Y)rk(Y)) + op(1).

Then

1√n

n∑j=1

(T(yj)rk(yj) − T(yj)rk(yj)) = 1√n

n∑j=1

(zkjE(T(Y)NT(Y))G−1(q)N(yj) − E(T(Y)rk(Y))) + op(1)

= 1√n

n∑j=1

(zkjT(yj) − E(T(Y)rk(Y))) + op(1).

The proof is finished. �

References

Bura, E., 2003. Using linear smoothers to assess the structural dimension of regressions. Statist. Sinica 13, 143–162.Bura, E., Cook, R.D., 2001. Extending sliced inverse regression: the weighted chi-square test. J. Amer. Statist. Assoc. 96, 996–1003.Cook, R.D., 1996. Graphics for regressions with a binary response. J. Amer. Statist. Assoc. 91, 983–992.Cook, R.D., 1998. Regression Graphics: Ideas for Studying Regressions through Graphics. Wiley, New York.Cook, R.D., 2000. SAVE: a method for dimension reduction and graphics in regression. Comm. Statist. Theory Methods 29, 2109–2121.Cook, R.D., Weisberg, S., 1991. Discussion to “Sliced inverse regression for dimension reduction”. J. Amer. Statist. Assoc. 86, 316–342.Ferre, L., 1998. Determining the dimension in sliced inverse regression and related methods. J. Amer. Statist. Assoc. 93, 132–140.


Li, K.C., 1991. Sliced inverse regression for dimension reduction (with discussion). J. Amer. Statist. Assoc. 86, 316–342.Li, Y.X., Zhu, L.X., 2007. Asymptotics for sliced average variance estimation. Ann. Statist. 35, 41–69.Schumaker, L.L., 1981. Spline Functions. Wiley, New York.Serfling, R., 1980. Approximation Theorems of Mathematical Statistics. Wiley, New York.Stute, W., Zhu, L.X., 2005. Nonparametric checks for single-index models. Ann. Statist. 33, 1048–1083.Zhou, S., Shen, X., Wolfe, D.A., 1998. Local asymptotics for regression splines and confidence regions. Ann. Statist. 26, 1760–1782.Zhu, L.P., Zhu, L.X., 2007. On kernel method for sliced average variance estimation. J. Multivariate Anal. 98, 970–991.Zhu, L.X., Fang, K.T., 1996. Asymptotics for kernel estimate of sliced inverse regression. Ann. Statist. 24, 1053–1068.Zhu, L.X., Miao, B.Q., Peng, H., 2006. On sliced inverse regression with high-dimensional covariates. J. Amer. Statist. Assoc. 101, 630–643.Zhu, L.X., Ng, K.W., 1995. Asymptotics of sliced inverse regression. Statist. Sinica 5, 727–736.Zhu, L.X., Ohtaki, M., Li, Y.X., 2007. On hybrid methods of inverse regression-based algorithms. Comput. Statist. Data Anal. 51, 2621–2635.

On splines approximation for sliced average variance estimation

Documents

Transcript of On splines approximation for sliced average variance estimation