An Overview on the Shrinkage Properties of Partial Least Squares Regression

An Overview on the Shrinkage Properties ofPartial Least Squares Regression

Nicole Krämer1

1 TU Berlin, Department of Computer Science and ElectricalEngineering, Franklinstr. 28/29, D-10587 Berlin

Summary

The aim of this paper is twofold. In the �rst part, we recapitulate the mainresults regarding the shrinkage properties of Partial Least Squares (PLS)regression. In particular, we give an alternative proof of the shape of thePLS shrinkage factors. It is well known that some of the factors are > 1.We discuss in detail the e�ect of shrinkage factors for the Mean SquaredError of linear estimators and argue that we cannot extend the results toPLS directly, as it is nonlinear. In the second part, we investigate the e�ectof shrinkage factors empirically. In particular, we point out that experimentson simulated and real world data show that bounding the absolute value ofthe PLS shrinkage factors by 1 seems to leads to a lower Mean Squared Error.

Keywords: linear regression, biased estimators, mean squared error

1

1 Introduction

In this paper, we want to give a detailed overview on the shrinkage propertiesof Partial Least Squares (PLS) regression. It is well known (Frank & Fried-man 1993) that we can express the PLS estimator obtained after m steps inthe following way:

β(m)

PLS =∑

i

f (m)(λi)zi ,

where zi is the component of the Ordinary Least Squares (OLS) estimatoralong the ith principal component of the covariance matrix XtX and λi isthe corresponding eigenvalue. The quantities f (m)(λi) are called shrinkagefactors. We show that these factors are determined by a tridiagonal matrix(which depends on the input�output matrix (X,y)) and can be calculatedin a recursive way. Combining the results of Butler & Denham (2000) andPhatak & de Hoog (2002), we give a simpler and clearer proof of the shapeof the shrinkage factors of PLS and derive some of their properties. In par-ticular, we reproduce the fact that some of the values f (m)(λi) are greaterthan 1. This was �rst proved by Butler & Denham (2000).

We argue that these �peculiar shrinkage properties� (Butler & Denham 2000)do not necessarily imply that the Mean Squared Error (MSE) of the PLSestimator is worse compared to the MSE of the OLS estimator. In the caseof deterministic shrinkage factors, i.e. factors that do not depend on theoutput y, any value

∣∣f (m) (λi)∣∣ > 1 is of course undesirable. But in the case

of PLS, the shrinkage factors are stochastic � they also depend on y . In par-ticular, bounding the absolute value of the shrinkage factor by 1 might notautomatically yield a lower MSE, in disagreement to what was conjecturedin e.g. Frank & Friedman (1993).

Having issued this warning, we explore whether bounding the shrinkage fac-tors leads to a lower MSE or not. It is very di�cult to derive theoreticalresults, as the quantities of interest - β

(m)

PLS and f (m)(λi) respectively - de-pend on y in a complicated, nonlinear way. As a substitute, we study theproblem on several arti�cial data sets and one real world example. It turnsout that in most cases, the MSE of the bounded version of PLS is indeedsmaller than the one of PLS.

2

2 Preliminaries

We consider the multivariate linear regression model

y = Xβ + ε (1)

with

cov (y) = σ2 · In . (2)

Here, In is the identity matrix of dimension n. The numbers of variables isp, the number of examples is n . For simplicity, we assume that X and y arescaled to have zero mean, so we do not have to worry about intercepts. Wehave

X ∈ Rn×p ,

A := XtX ∼ cov(X) ∈ Rp×p ,

y ∈ Rn ,

b := Xty ∼ cov(X, y) ∈ Rp .

We set p∗ = rk (A) = rk (X). The singular value decomposition of X is ofthe form

X = V ΣU t

with

V ∈ Rn×p , Σ = diag (σ1, . . . , σp) ∈ Rp×p , U ∈ Rp×p .

The columns of U and V are mutually orthogonal, that is we have U tU = Ip

and V tV = Ip.

We set λi = σ2i and Λ = Σ2 . The eigendecomposition of A is

A = UΛU t =p∑

i=1

λiuiuti .

The eigenvalues λi of A (and any other matrix) are ordered in the followingway:

λ1 ≥ λ2 ≥ . . . ≥ λp ≥ 0 .

The Moore-Penrose inverse of a matrix M is denoted by M−.

3

The Ordinary Least Squares (OLS) estimator βOLS is the solution of theoptimization problem

argminβ

‖y −Xβ‖ .

If there is no unique solution (which is in general the case for n > p), theOLS estimator is the solution with minimal norm. Set

s = ΣV ty . (3)

The OLS estimator is given by the formula

βOLS =(XtX

)−Xty = UΛ−ΣV ty = UΛ−s =

p∗∑

i=1

vtiy√λi

ui .

We de�ne

zi =vt

iy√λi

ui .

This implies

βOLS =p∗∑

i=1

zi .

Set

K(m) :=(A0b,Ab, . . . , Am−1b

) ∈ Rp×m .

The columns of K(m) are called the Krylov sequence of A and b. The spacespanned by the columns of K(m) is called the Krylov space of A and b anddenoted by K(m). Krylov spaces are closely related to the Lanczos algorithm(Lanczos 1950), a method for approximating eigenvalues of the matrix A. Weexploit the relationship between PLS and Krylov spaces in the subsequentsections. An excellent overview of the connections of PLS to the Lanczosmethod (and the conjugate gradient algorithm) can be found in Phatak &de Hoog (2002). Set

M := {λi|si 6= 0} ={λi 6= 0|vt

iy 6= 0}

(the vector s is de�ned in (3)) and

m∗ := |M| .

It follows easily that m∗ ≤ p∗ = rk(X). The inequality is strict if A hasnon-zero eigenvalues of multiplicity > 1 or if there is a principal componentvi that is not correlated to y, i.e. vt

iy = 0 . The quantity m∗ is also called thegrade of b with respect to A. We state a standard result on the dimensionof the Krylov spaces associated to A and b.

4

Proposition 1. We have

dimK(m) =

{m m ≤ m∗

m∗ m > m∗ .

In particular

dimK(m∗) = dimK(m∗+1) = . . . = dimK(p) = m∗ . (4)

Finally, let us introduce the following notation. For any set S of vectors, wedenote by PS the projection onto the space that is spanned by S . It followsthat

PS = S(StS

)−St .

3 Partial Least Squares

We only give a sketchy introduction to the PLS method. More details canbe found e.g. in Höskuldsson (1988) or Rosipal & Krämer (2006). The mainidea is to extract m orthogonal components from the predictor space X and�t the response y to these components. In this sense, PLS is similar to Prin-cipal Components Regression (PCR). The di�erence is that PCR extractscomponents that explain the variance in the predictor space whereas PLSextracts components that have a high covariance with y. The quantity m iscalled the number of PLS steps or the number of PLS components.

We now formalize this idea. The �rst latent component t1 is a linear combi-nation t1 = Xw1 of the predictor variables. The vector w1 is usually calledthe weight vector. We want to �nd a component with maximal covariance toy, that is we want to compute

w1 = arg max‖w‖=1

cov (Xw, y) = arg max‖w‖=1

wtb . (5)

Using Lagrangian multipliers, we conclude that the solution w1 is � up to afactor � equal to Xty = b. Subsequent components t2, t3, . . . are chosen suchthat they maximize (5) and that all components are mutually orthogonal. Weensure orthogonality by de�ating the original predictor variables X. That is,we only consider the part of X that is orthogonal on all components tj forj < i:

Xi = X − Pt1,...,ti−1X .

We then replace X by Xi in (5). This version of PLS is called the NIPALSalgorithm (Wold 1975).

5

Algorithm 2 (NIPALS algorithm). After setting X1 = X, the weight vec-tors wi and the components ti of PLS are determined by iteratively computing

wi = XTi Y weight vector

ti = Xiwi componentXi+1 = Xi −Pti

Xi de�ation

The �nal estimator y for Xβ is

y = Pt1,...,tmy =m∑

j=1

Ptjy .

The last equality follows as the PLS components t1, . . . , tm are mutuallyorthogonal. We denote by W (m) the matrix that consists of the weightvectors w1, . . . , wm de�ned in algorithm 2:

W (m) = (w1, . . . , wm) . (6)

The PLS components tj and the weight vectors are linked in the followingway (see e.g. Höskuldsson (1988)):

(t1, . . . , tm) = XW (m)R(m)

with an invertible bidiagonal matrix R(m) . Plugging this into (3), we yield

y = XW (m)

((W (m)

)t

XtXW (m)

)− (W (m)

)t

Xty .

It can be shown (Helland 1988) that the space spanned by the vectors wi

(i = 1, . . . , m) equals the Krylov space K(m) that is de�ned in section 2.More precisely, W (m) is an orthogonal basis of K(m) that is obtained by aGram-Schmidt procedure. This implies:

Proposition 3 ((Helland 1988)). The PLS estimator obtained after m stepscan be expressed in the following way:

β(m)

PLS = K(m)

[(K(m)

)t

AK(m)

]− (K(m)

)t

b . (7)

An equivalent expression is that the PLS estimator β(m)

PLS for β is the solutionof the constrained minimization problem

argminβ

‖y −Xβ‖

subject to β ∈ K(m) .

6

Of course, the orthogonal basis W (m) of the Krylov space only exists ifdimK(m) = m, which might not be true for all m ≤ p. The maximal numberfor which this holds is m∗ (see proposition 1). Note however that it followsfrom (4) that

K(m∗−1) ⊂ K(m∗) = K(m∗+1) = . . . = K(p) ,

and the solution of the optimization problem does not change anymore.Hence there is no loss in generality if we make the assumption that

dimK(m) = m. (8)

Remark 4. We have

β(m∗)PLS = βOLS .

Proof. This result is well-known and it is usually proven using the fact that� after the maximal number of steps � the vectors t1, . . . , tm span the samespace as the columns of X. Here we present an algebraic proof that exploitsthe relationship between PLS and Krylov spaces. We show that the OLSestimator is an element of K(m∗), that is βOLS = πOLS(A)b for a polynomialπOLS of degree ≤ m∗ − 1. We de�ne this polynomial via the m∗ equations

πOLS(λi) =1λi

, λi ∈M .

In matrix notation, this equals

πOLS(Λ)s = Λ−s . (9)

Using (2), we conclude that

πOLS(A)b = UπOLS(Λ)ΣV ty = UπOLS(Λ)s(9)= UΛ−s = βOLS .

Set

D(m) =(W (m)

)t

AW (m) ∈ Rm×m .

Proposition 5. The matrix D(m) is symmetric and positive semide�nite.Furthermore D(m) is tridiagonal: dij = 0 for |i− j| ≥ 2.

Proof. The �rst two statements are obvious. Let i ≤ j − 2. As wi ∈ K(i) ,the vector Awi lies in the subspace K(i+1). As j > i + 1, the vector wj isorthogonal on K(i+1), in other words

dji = 〈wj , Awi〉 = 0 .

As D(m) is symmetric, we also have dij = 0 which proves the assertion.

7

4 Tridiagonal matrices

We see in section 6 that the matrices D(m) and their eigenvalues determinethe shrinkage factors of the PLS estimator. To prove this, we now list someproperties of D(m).

De�nition 6. A symmetric tridiagonal matrix D is called unreduced if allsubdiagonal entries are non-zero, i.e. di,i+1 6= 0 for all i.

Theorem 7 ((Parlett 1998)). All eigenvalues of an unreduced matrix aredistinct.

Set

D(m) =

a1 b1 0 . . . 0b1 a2 b2 . . . 0

. . . . . . . . . . . ....

0 0 . . . am−1 bm−1

0 0 . . . bm−1 am

.

Proposition 8. If dimK(m) = m, the matrix D(m) is unreduced. Moreprecisely bi > 0 for all i ∈ {1, . . . , m− 1} .

Proof. Set pi = Ai−1b and denote by w1, . . . , wm the basis (6) obtained byGram-Schmidt. Its existence is guaranteed as we assume that dimK(m) = m.We have to show that

bi = 〈wi,Awi−1〉 > 0 .

As the length of wi does not change the sign of bi, we can assume that thevectors wi are not normalized to have length 1. By de�nition

wi = pi −i−1∑

k=1

〈pi, wk〉〈wk, wk〉 ·wk . (10)

As the vectors wi are pairwise orthogonal, it follows that

〈wi, pi〉 = 〈pi, pi〉 > 0 .

8

We conclude that

bi = 〈wi, Awi−1〉(10)=

⟨wi, A ·

(pi−1 −

i−2∑

k−1

〈pi−1,wk〉〈wk, wk〉 ·wk

)⟩

Api−1=pi= 〈wi, pi〉 −i−2∑

k=1

〈pi−1,wk〉〈wk, wk〉〈wi,Awk〉

(5)= 〈wi, pi〉

(11)= 〈pi,pi〉 > 0

Note that the matrix D(m−1) is obtained from D(m) by deleting the lastcolumn and row of D(m). It follows that we can give a recursive formula forthe characteristical polynomials

χ(m) := χD(m)

of D(m). We have

χ(m) (λ) = (am − λ) · χ(m−1)(λ)− b2m−1χ

(m−2)(λ) . (11)

We want to deduce properties of the eigenvalues of D(m) and A and exploretheir relationship. Denote the eigenvalues of D(m) by

µ(m)1 > . . . > µ(m)

m ≥ 0 . (12)

Remark 9. All eigenvalues of D(m∗) are eigenvalues of A.

Proof. First note that

A|K(m∗) : K(m∗) −→ K(m∗+1) = K(m∗) .

As the columns of the matrix W (m∗) form an orthonormal basis of K(m∗) ,

D(m∗) =(W (m∗)

)t

AW (m∗)

is the matrix that represents A|K(m∗) with respect to this basis. As anyeigenvalue of A|K(m∗) is obviously an eigenvalue of A, the proof is complete.

9

The following theorem is a special form of the Cauchy Interlace Theorem.In this version, we use a general result from Parlett (1998) and exploit thetridiagonal structure of D(m).

Theorem 10. Each interval[µ

(m)m−j , µ

(m)m−(j+1)

]

(j = 0, . . . ,m − 2) contains a di�erent eigenvalue of D(m+k)) (k ≥ 1). Inaddition, there is a di�erent eigenvalue of D(m+k) outside the open interval(µ(m)

m , µ(m)1 ) .

This theorems ensures in particular that there is a di�erent eigenvalue of A

in the interval[µ

(m)k , µ

(m)k−1

]. Theorem 10 holds independently of assumption

(8).

Proof. By de�nition, for k ≥ 1

D(m+k) =

D(m−1) •t 0• am ∗0 ∗ ∗

.

Here • = (bm−1, . . . , 0, 0), so

D(m) =(

D(m−1) •t

• am

).

An application of theorem 10.4.1 in Parlett (1998) gives the desired result.

Lemma 11. If D(m) is unreduced, the eigenvalues of D(m) and the eigen-values of D(m−1) are distinct.

Proof. Suppose the two matrices have a common eigenvalue λ. It follows from(11) and the fact that D(m) is unreduced that λ is an eigenvalue of D(m−2).Repeating this, we deduce that a1 is an eigenvalue of D(2), a contradiction,as

0 = χ(2)(a1) = −b21 < 0 .

Remark 12. In general it is not true that D(m) and a submatrix D(k) havedistinct eigenvalues. Consider the case where ai = c for all i. Using equation(11) we conclude that c is an eigenvalue for all submatrices with m odd.

Proposition 13. If dimK(m) = m, we have det(D(m−1)

) 6= 0.

10

Proof. The matrix D(m) is positive semide�nite , hence all eigenvalues ofD(m) are ≥ 0. In other words, det

(D(m−1)

) 6= 0 if and only if its smallesteigenvalue µ

(m−1)m−1 is > 0. Using Theorem 10 we have

µ(m)m ≥ µ

(m−1)m−1 ≥ 0 .

As dimK(m) = m, the matrix D(m) is unreduced, which implies that D(m)

and D(m−1) have no common eigenvalues (see 11). We can therefore replacethe �rst ≥ by >, i.e. the smallest eigenvalue of D(m−1) is > 0.

It is well known that the matrices D(m) are closely related to the so-calledRayleigh-Ritz procedure, a method that is used to approximate eigenvalues.For details consult e.g. Parlett (1998).

5 What is shrinkage?

We presented two estimators for the regression parameter β � OLS and PLS� which also de�ne estimators for Xβ via

y• = Xβ• .

One possibility to evaluate the quality of an estimator is to determine itsMean Squared Error (MSE). In general, the MSE of an estimator θ for avector-valued parameter θ is de�ned as

MSE(θ)

= E

[trace

(θ − θ

)(θ − θ

)t]

= E

[(θ − θ

)t (θ − θ

)]

=(E

[θ]− θ

)t (E

[θ]− θ

)+ E

[(θ

t − E[θ])t (

θt − E

[θ])]

.

This is the well-known bias-variance decomposition of the MSE. The �rstpart is the squared bias and the second part is the variance term.We start by investigating the class of linear estimators, i.e. estimators thatare of the form θ = Sy for a matrix S that does not depend on y. It followsimmediately from the regression model (1) and (2) that for a linear estimator,

E[θ]

= SXβ , var[θ]

= σ2trace (SSt) .

The OLS estimators are linear:

βOLS =(XtX

)−Xty ,

yOLS = X(XtX

)−Xty .

11

Note that the estimator of yOLS is simply the projection PX onto the spacethat is spanned by the columns of X.

The estimator yOLS is unbiased as

E [yOLS ] = PXXβ = Xβ .

The estimator βOLS is only unbiased if β ∈ range (XtX)− :

E[βOLS

]= E

[(XtX

)−Xty

]=

(XtX

)−XtE [y] =

(XtX

)−XtXβ = β .

Let us now have a closer look at the variance term. It follows directly fromtrace(PXPt

X) = rk(X) = p∗ that

var (yOLS) = σ2p∗ .

For βOLS we have(XtX

)−Xt

((XtX

)−Xt

)t

=(XtX

)− = UΛ−U t ,

hence

var(βOLS

)= σ2

p∗∑

i=1

1λi

. (13)

We conclude that the MSE of the estimator βOLS depends on the eigenvaluesλ1, . . . , λp∗ of A = XtX. Small eigenvalues of A correspond to directions inX that have a low variance. Equation (13) shows that if some eigenvaluesare small, the variance of βOLS is very high, which leads to a high MSE.

One possibility to (hopefully) decrease the MSE is to modify the OLS esti-mator by shrinking the directions of the OLS estimator that are responsiblefor a high variance. This of course introduces bias. We shrink the OLS esti-mator in the hope that the increase in bias is small compared to the decreasein variance.

In general, a shrinkage estimator for β is of the form

βshr =p∗∑

i=1

f(λi)zi ,

where f is some real-valued function. The values f(λi) are called shrinkagefactors. Examples are

12

• Principal Component Regression

f(λi) =

{1 ith principal component included0 otherwise

and

• Ridge Regression

f(λi) =λi

λi + λ

with λ > 0 the Ridge parameter.

We illustrate in section 6 that PLS is a shrinkage estimator as well. It turnsout that the shrinkage behavior of PLS regression is rather complicated.

Let us investigate in which way the MSE of the estimator is in�uenced by theshrinkage factors. If the shrinkage estimators are linear, i.e. the shrinkagefactors do not depend on y, this is an easy task. Let us �rst write theshrinkage estimator in matrix notation. We have

βshr = Sshry = UΣ−DshrVty .

The diagonal matrix Dshr has entries f(λi). The shrinkage estimator for yis

yshr = XSshry = V ΣΣ−DshrVt .

We calculate the variance of these estimators.

trace(SshrS

tshr

)= trace

(UΣ−DshrΣ−DshrU

t)

= trace(Σ−DshrΣ−Dshr

)

=p∗∑

i=1

(f (λi))2

λi

and

trace(XSshrS

tshrX

t)

= trace(V ΣΣ−DshrΣΣ−DshrV

t)

= trace(ΣΣ−DshrΣΣ−Dshr

)

=p∗∑

i=1

(f (λi))2

.

Next, we calculate the bias of the two shrinkage estimators. We have

E [Sshry] = SshrXβ = UΣDshrΣ−U tβ .

13

It follows that

bias2(βshr

)= (E [Sshry]− β)t (E [Sshry]− β)

=(U tβ

)t (ΣDshrΣ− − Ip

)t (ΣDshrΣ− − Ip

) (U tβ

)

=p∗∑

i=1

(f(λi)− 1)2(ut

iβ)2

.

Replacing Sshr by XSshr it is easy to show that

bias2 (yshr) =p∑

i=1

λi (f(λi)− 1)2(ut

iβ)2

.

Proposition 14. For the shrinkage estimator βshr and yshr de�ned abovewe have

MSE(βshr

)=

p∗∑

i=1

(f(λi)− 1)2(ut

iβ)2 + σ2

p∗∑

i=1

(f (λi))2

λi

MSE (yshr) =p∗∑

i=1

λi (f(λi)− 1)2(ut

iβ)2 + σ2

p∗∑

i=1

(f (λi))2

.

If the shrinkage factors are deterministic, i.e. they do not depend on y, anyvalue f(λi) 6= 1 increases the bias. Values |f(λi)| < 1 decrease the variance,whereas values |f(λi)| > 1 increase the variance. Hence an absolute value> 1 is always undesirable. The situation might be di�erent for stochasticshrinkage factors. We discuss this in the following section.

Note that there is a di�erent notion of shrinkage, namely that the L2- normof an estimator is smaller than the L2-norm of the OLS estimator. Why isthis a desirable property? Let us again consider the case of linear estimators.Set θi = Siy for i = 1, 2. We have

∥∥∥θi

∥∥∥2

2= ytSt

iSiy .

The property that for all y ∈ Rn

∥∥∥θ1

∥∥∥2

≤∥∥∥θ2

∥∥∥2

is equivalent to the condition that St1S1−St

2S2 is negative semide�nite. Thetrace of negative semide�nite matrices is ≤ 0. Furthermore trace (St

iSi) =trace (SiS

ti ), so we conclude that

var(θ1

)≤ var

(θ2

).

14

It is known (de Jong (1995)) that

‖β(1)

PLS‖2 ≤ ‖β(2)

PLS‖2 ≤ . . . ≤ ‖β(m∗)PLS‖2 = ‖βOLS‖2 .

6 The shrinkage factors of PLS

In this section, we give a simpler and clearer proof of the shape of the shrink-age factors of PLS. Basically, we combine the results of Butler & Denham(2000) and Phatak & de Hoog (2002). In the rest of the section, we assumethat m < m∗ as the shrinkage factors for β

(m)

PLS = βOLS are trivial, i.e.f (m∗)(λi) = 1 .By de�nition of the PLS estimator, β

(m)

PLS ∈ K(m) . Hence there is a polyno-mial π of degree ≤ m − 1 with β

(m)

PLS = π(A)b . Recall that the eigenvaluesof D(m) are denoted by µ

(m)i . Set

f (m)(λ) := 1−m∏

i=1

(1− λ

µ(m)i

)

= 1− 1χ(m)(0)

χ(m)(λ) .

As f (m)(0) = 0, there is a polynomial π(m) of degree m− 1 such that

f (m)(λ) = λ · π(m)(λ) . (14)

Proposition 15 ((Phatak & de Hoog 2002)). Suppose that m < m∗ . Wehave

β(m)

PLS = π(m)(A) · b .

Proof (Phatak & de Hoog 2002). Using either equation (14) or the Caley-Hamilton theorem (recall proposition 13), it is easy to prove that

(D(m)

)−1

= π(m)(D(m)

).

We plug this into equation (7) and obtain

β(m)

PLS = W (m)π(m)

((W (m)

)t

AW (m)

) (W (m)

)t

b .

Recall that the columns of W (m) form an orthonormal basis of K(m). Itfollows that W (m)

(W (m)

)t is the operator that projects on the space K(m).In particular

W (m)(W (m)

)t

Ajb = Ajb

15

for j = 1, . . . , m− 1. This implies that

β(m)

PLS = π(m)(A)b .

Using (14), we can immediately conclude the following corollary.

Corollary 16 ((Phatak & de Hoog 2002)). Suppose that dimK(m) = m. Ifwe denote by zi the component of βOLS along the ith eigenvector of A then

β(m)

PLS =p∗∑

i=1

f (m)(λi) · zi ,

with f (m)(λ) de�ned in (14).

We now show that some of the shrinkage factors of PLS are 6= 1 .

Theorem 17 ((Butler & Denham 2000)). For each m ≤ m∗ − 1, we candecompose the interval [λp, λi] into m + 1 disjoint intervals1

I1 ≤ I2 ≤ . . . ≤ Im+1

such that

f (m) (λi)

{≤ 1 λi ∈ Ij and j odd≥ 1 λi ∈ Ij and j even

.

Proof. Set g(m)(λ) = 1 − f (m)(λ). It follows from equation (14) that thezero's of g(m)(λ) are µ

(m)m , . . . , µ

(m)1 . As D(m) is unreduced, all eigenvalues

are distinct. Set µ(m)0 = λ1 and µ

(m)m+1 = λp. De�ne Ij =]µ(m)

i , µ(m)i+1 [ for

j = 0, . . . , m . By de�nition, g(m)(0) = 1. Hence g(m)(λ) is non-negative onthe intervals Ij if j is odd and g(m) is non-positive on the intervals Ij if jis even. It follows from theorem 10 that all intervals Ij contain at least oneeigenvalue λi of A .

In general it is not true that f (m)(λi) 6= 1 for all λi and m = 1, . . . , m∗ .Using the example in remark 12 and the fact that

f (m)(λi) = 1

is equivalent to the condition that λi is an eigenvalue of D(m), it is easyto construct a counterexample. Using some of the results of section 4, we

1We say that Ij ≤ Ik if sup Ij ≤ inf Ik .

16

can however deduce that some factors are indeed 6= 1. As all eigenvalues ofD(m∗−1) and D(m∗) are distinct (proposition 11), we see that f (m∗−1)(λi) 6= 1for all i. In particular

f (m∗−1)(λ1)

{< 1 m∗ even> 1 m∗ odd

.

More generally, using proposition 11, we conclude that f (m−1) (λi) = 1 andf (m) (λi) = 1 is not possible. In practice � i.e. calculated on a data set � theshrinkage factors seem to be 6= 1 all of the time. Furthermore

0 ≤ f (m)(λp) < 1 .

To prove this, we set g(m)(λ) = 1 − f (m)(λ). We have g(m)(0) = 1. Fur-thermore, the smallest positive zero of g(m)(λ) is µ

(m)m and it follows from

theorem 10 and proposition 11 that λp < µ(m)m . Hence g(m)(λp) ∈]0, 1].

Using theorem 10, more precisely

λp ≤ µ(m)i ≤ λi ,

it is possible to bound the terms

1− λi

µ(m)i

.

From this we can derive bounds on the shrinkage factors. We do not pursuethis further. Readers who are interested in the bounds should consult Ling-jaerde & Christopherson (2000). Instead, we have a closer look at the MSEof the PLS estimator.

In section 5, we showed that a value |f (m)(λi)| > 1 is not desirable, as boththe bias and the variance of the estimator increases. Note however that inthe case of PLS, the factors f (m)(λi) are stochastic; they depend on y � ina nonlinear way. The variance of the PLS estimator for the ith principalcomponent is

var(

f (m)(λi) · (vi)ty√

λi

)

with both f (m)(λi) and vtiy√λi

depending on y.

17

Among others, Frank & Friedman (1993) propose to truncate the shrinkagefactors of the PLS estimator in the following way. Set

f (m)(λi) =

+1 f (m)(λi) > +1−1 f (m)(λi) < −1f (m)(λi) otherwise

and de�ne a new estimator:

β(m)

TRN :=p∑

i=1

f (m)(λi)zi . (15)

If the shrinkage factors are numbers, this will improve the MSE (cf. section5). But in the case of stochastic shrinkage factors, the situation might bedi�erent. Let us suppose for a moment that f (m)(λi) =

√λi

vtiy

. It follows that

0 = var(

f (m)(λi) · vtiy√λi

)≤ var

(f (m)(λi) · vt

iy√λi

),

so it is not clear whether the truncated estimator TRN leads to a lower MSE,which is conjectured e.g. in Frank & Friedman (1993).

The assumption that f (m)(λi) =√

λi

vtiy

is of course purely hypothetical. It isnot clear whether the shrinkage factors behave this way. It is hard if notinfeasible to derive statistical properties of the PLS estimator or its shrink-age factors, as they depend on y in a complicated, nonlinear way. As analternative, we compare the two di�erent estimators on di�erent data sets.

7 Experiments

In this section, we explore the di�erence between the methods PLS and TRN.We investigate several arti�cial datasets and one real world example.

Simulation

We compare the MSE of the two methods - PLS and truncated PLS - on 27di�erent arti�cial data sets. We use a setting similar to the one in Frank &Friedman (1993). For each data set, the number of examples is n = 50. Weconsider three di�erent number of predictor variables:

p = 5, 40, 100 .

18

The input data X is chosen according to a multivariate normal distribu-tion with zero mean and covariance matrix C . We consider three di�erentcovariance matrices:

C1 = Ip ,

(C2)ij =1

|i− j|+ 1,

(C3)ij ={

1 , i = j0.7 , i 6= j

.

The matrices C1, C2 and C3 correspond to no, moderate and high collinearityrespectively. The regression vector β is a randomly chosen vector β ∈ {0, 1}p .In addition, we consider three di�erent signal-to-noise ratios:

stnr =

√var (Xβ)

σ2

= 1, 3, 7 .

We yield 3 · 3 · 3 = 27 di�erent parameter settings. For each setting, weestimate the MSE of the two methods: For k = 1, . . . ,K = 200, we generatey according to (1) and (2). We determine for each method and each m therespective estimator βk and de�ne

MSE(β) =1K

K∑

k=1

(βk − β

)t (βk − β

).

If there are more predictor variables than examples, this approach is notsensible, as the true regression vector β is not identi�able. This implies thatdi�erent regressions vectors β1 6= β2 can lead to Xβ1 = Xβ2. Hence forp = 100, we estimate the MSE of y for the two methods. We display theestimated MSE of the method TRN as a fraction of the estimated MSE ofthe method PLS, i.e. for each m we display

MSE −RATIO =MSE

(β

(m)

TRN

)

MSE

(β

(m)

PLS

) .

(As already mentioned, we display the MSE-RATIO for y in the case p =100.) The results are displayed in �gures 1, 2 and 3. In order to have a com-pact representation, we consider the averaged MSE-RATIOS for di�erentparameter settings. E.g. we �x a degree of collinearity (say high collinearity)and display the averaged MSE-RATIO over the three di�erent signal-to-noiseratios. The results for all 27 data sets are shown in the tables in the appendix.

19

1 2 3 4 5

0.70

0.75

0.80

0.85

0.90

0.95

1.00

1 2 3 4 5

0.70

0.75

0.80

0.85

0.90

0.95

1.00

Figure 1: MSE-RATIO for p = 5 . The �gures show the averaged MSE-RATIO for di�erent parameter settings. Left: Comparison for high (straightline), moderate (dotted line) and no (dashed line) collinearity. Right: Com-parison for stnr 1 (straight line), 3 (dotted line) and 7 (dashed line).

0 10 20 30 40

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0 10 20 30 40

0.70

0.75

0.80

0.85

0.90

0.95

1.00

Figure 2: MSE-RATIO for p = 40.

There are several observations: The MSE of TRN is lower almost all of thetimes. The decrease of the MSE is particularly large if the number of compo-nents m is small, but > 1 . For larger m, the di�erence decreases. This is notsurprising, as for large m, the di�erence between the PLS estimator and theOLS estimator decreases. Hence we expect the di�erence between TRN andPLS to become smaller. The reduction of the MSE is particularly prominentin complex situations, i.e. in situations with collinearity in X or with a lowsignal-to-noise-ratio.

Another feature which cannot be deduced from �gures 1, 2 and 3 but from the

20

5 10 15 20

0.70

0.75

0.80

0.85

0.90

0.95

1.00

5 10 15 20

0.70

0.75

0.80

0.85

0.90

0.95

1.00

Figure 3: MSE-RATIO for p = 100. In this case, we display the MSE-RATIOfor y instead of β. Only the �rst 20 components are displayed.

tables in the appendix is the fact that the optimal numbers of components

moptPLS = argmin MSE

(β

(m)

PLS

)

moptTRN = argmin MSE

(β

(m)

TRN

)

are equal almost all of the times. This is also true if we consider the MSEof y. We can bene�t from this if we want to select an optimal model fortruncated PLS. We return to this subject in section 8.

Real world data

In this example, we consider the near infrared spectra (NIR) of n = 171 meatsamples that are measured at p = 100 di�erent wavelengths from 850 � 1050nm. This data set is taken from the StatLib datasets archive and can bedownloaded from http://lib.stat.cmu.edu/datasets/tecator. The taskis to predict the fat content of a meat sample on the basis of its NIR spectrum.We choose this dataset as PLS is widely used in the chemometrics �eld.In this type of applications, we usually observe a lot of predictor variableswhich are highly correlated. We estimate the MSE of the two methods PLSand truncated PLS by computing the 10fold cross-validated error of the twoestimators. The results are displayed in �gure 4.Again, TRN is better allmost all of the times, although the di�erence issmall. Note furthermore that the optimal number of components are almostidentical for the two methods: We have mopt

PLS = 15 and moptTRN = 16.

21

10 20 30 40 50

1015

20

Figure 4: 10fold cross-validated test error for the Tecator data set. Thestraight line corresponds to PLS, the dashed line corresponds to truncatedPLS.

8 Discussion

We saw in section 7 that bounding the absolute value of the PLS shrinkagefactors by one seems to improve the MSE of the estimator. So should wenow discard PLS and always use TRN instead? There might be (at least)two objections. Firstly, it would be somewhat lightheaded if we relied onresults of a small-scale simulation study. Secondly, TRN is computationallymore extensive than PLS. We need the full singular value decomposition ofX . In each step, we have to compute the PLS estimator and adjust itsshrinkage factors �by hand�. However, the experiments suggest that it can beworthwhile to compare PLS and truncated PLS. We pointed out in section 7that the two methods do not seem to di�er much in terms of optimal numberof components. In order to reduce the computational costs of truncated PLS,we therefore suggest the following strategy. We �rst compute the optimalPLS model on a training set and choose the optimal model with the help of amodel selection criterion. In a second step, we truncate the shrinkage factorsof the optimal model. We then use a validation set in order to quantifythe di�erence between PLS and TRN and choose the method with the lowervalidation error.

ReferencesButler, N. & Denham, M. (2000), `The Peculiar Shrinkage Properties of Par-

tial Least Squares Regression', Journal of the Royal Statistical Society Se-

22

ries B 62 (3), 585�593.

de Jong, S. (1995), `PLS shrinks', Journal of Chemometrics 9, 323�326.

Frank, I. & Friedman, J. (1993), À Statistical View of some ChemometricsRegression Tools', Technometrics 35, 109�135.

Helland, I. (1988), Òn the Structure of Partial Least Squares Regression',Communications in Statistics, Simulation and Computation 17(2), 581�607.

Höskuldsson, A. (1988), `PLS Regression Methods', Journal of Chemometrics2, 211�228.

Lanczos, C. (1950), Àn Iteration Method for the Solution of the EigenvalueProblem of Linear Di�erential and Integral Operators', Journal of Researchof the National Bureau of Standards 45, 225�280.

Lingjaerde, O. & Christopherson, N. (2000), `Shrinkage Structures of PartialLeast Squares', Scandinavian Journal of Statistics 27, 459�473.

Parlett, B. (1998), The Symmetric Eigenvalue Problem, Society for Industrialand Applied Mathematics.

Phatak, A. & de Hoog, F. (2002), Èxploiting the Connection between PLS,Lanczos, and Conjugate Gradients: Alternative Proofs of some Propertiesof PLS', Journal of Chemometrics 16, 361�367.

Rosipal, R. & Krämer, N. (2006), Overview and Recent Advances in Par-tial Least Squares, in `Subspace, Latent Structure and Feature SelectionTechniques', Lecture Notes in Computer Science, Springer, 34�51.

Wold, H. (1975), Path models with Latent Variables: The NIPALS Approach,in H. B. et al., ed., `Quantitative Sociology: International Perspectives onMathematical and Statistical Model Building', Academic Press, 307�357.

23

A Appendix: Results of the simulation study

We display the results of the simulation study that is described in section7. The following tables show the MSE-RATIO for β as well as for y. Inaddition to the MSE ratio, we display the optimal number of components foreach method. It is interesting to see that the two quantities are the samealmost all of the times.

collinearity no no no med. med. med. high high highstnr 1 3 7 1 3 7 1 2 71 0.833 0.861 0.676 0.958 1.000 0.993 1.000 0.999 1.0002 0.980 0.976 0.975 0.995 0.938 0.864 0.847 0.965 0.8663 1.000 0.993 1.001 0.969 0.960 0.993 0.954 0.980 0.9674 1.000 1.001 0.999 0.988 1.000 1.002 0.997 0.993 0.992

moptPLS 2 5 2 2 4 3 1 2 5

moptTRN 2 5 2 2 4 3 1 2 5

Table 1: MSE-RATIO of β for p = 5. The �rst two rows display the settingof the parameters. The rows entitled 1-4 display the MSE ratio for therespective number of components.

collinearity no no no med. med. med. high high highstnr 1 3 7 1 3 7 1 2 71 0.775 0.780 0.570 0.919 1.000 0.970 1.004 0.995 0.9992 0.978 0.972 0.9697 0.994 0.882 0.786 0.828 0.951 0.8233 1.001 0.990 1.001 0.969 0.967 0.992 0.960 0.977 0.9734 1.000 1.001 0.999 0.990 1.000 1.001 0.997 0.996 0.993

moptPLS 3 5 3 2 4 4 1 2 5

moptTRN 2 5 2 2 4 4 1 2 3

Table 2: MSE-RATIO of y for p = 5.

24

collinearity no no no med. med. med. high high highstnr 1 3 7 1 3 7 1 2 71 0.929 0.963 0.972 0.98 0.998 0.989 1.000 1.000 1.0002 0.938 0.959 0.977 0.922 0.91 0.978 0.789 0.793 0.7923 0.907 0.952 0.981 0.875 0.91 0.945 0.849 0.843 0.8494 0.905 0.933 0.971 0.879 0.913 0.912 0.857 0.864 0.8685 0.901 0.942 0.954 0.879 0.924 0.898 0.870 0.883 0.8796 0.898 0.942 0.945 0.878 0.915 0.891 0.882 0.890 0.8937 0.892 0.926 0.949 0.887 0.906 0.891 0.891 0.895 0.8988 0.899 0.926 0.956 0.892 0.904 0.895 0.897 0.897 0.9039 0.908 0.933 0.955 0.897 0.910 0.895 0.903 0.902 0.90410 0.913 0.938 0.951 0.900 0.916 0.898 0.902 0.899 0.90111 0.913 0.937 0.947 0.902 0.919 0.907 0.906 0.901 0.90212 0.917 0.931 0.944 0.909 0.919 0.917 0.908 0.904 0.90613 0.924 0.932 0.946 0.919 0.92 0.925 0.913 0.907 0.91414 0.933 0.939 0.946 0.927 0.917 0.936 0.921 0.911 0.92215 0.94 0.945 0.95 0.933 0.916 0.936 0.928 0.916 0.93116 0.949 0.945 0.951 0.935 0.918 0.941 0.938 0.922 0.93617 0.956 0.945 0.954 0.939 0.922 0.945 0.944 0.926 0.93618 0.961 0.944 0.959 0.943 0.931 0.95 0.946 0.930 0.93519 0.968 0.946 0.964 0.946 0.934 0.958 0.953 0.939 0.93620 0.973 0.951 0.973 0.949 0.935 0.962 0.961 0.947 0.93921 0.977 0.958 0.977 0.954 0.936 0.966 0.968 0.955 0.94322 0.98 0.965 0.981 0.961 0.94 0.973 0.972 0.962 0.94823 0.984 0.97 0.984 0.968 0.945 0.98 0.976 0.967 0.95024 0.987 0.976 0.988 0.975 0.948 0.983 0.98 0.970 0.95325 0.989 0.98 0.99 0.978 0.953 0.987 0.981 0.973 0.95926 0.992 0.985 0.993 0.982 0.959 0.991 0.984 0.977 0.96627 0.994 0.989 0.996 0.986 0.966 0.992 0.987 0.981 0.97528 0.995 0.991 0.997 0.988 0.973 0.994 0.99 0.985 0.98429 0.996 0.993 0.998 0.99 0.978 0.995 0.993 0.988 0.98830 0.997 0.994 0.999 0.992 0.982 0.996 0.995 0.991 0.9931 0.998 0.995 0.999 0.994 0.985 0.997 0.996 0.992 0.99332 0.998 0.996 0.999 0.996 0.99 0.998 0.996 0.994 0.99533 0.999 0.997 1.000 0.996 0.991 0.999 0.997 0.995 0.99634 0.999 0.998 1.000 0.997 0.993 0.999 0.998 0.996 0.99735 0.999 0.999 1.000 0.999 0.994 0.999 0.998 0.997 0.99836 1.000 1.000 1.000 0.999 0.996 0.999 0.999 0.998 0.99837 1.000 1.000 1.000 0.999 0.998 1.000 0.999 0.998 0.99938 1.000 1.000 1.000 1.000 0.999 1.000 0.999 0.999 0.99939 1.000 1.000 1.000 1.000 0.999 1.000 1.000 0.999 1.000

moptPLS 1 3 5 1 2 2 1 1 1

moptTRN 1 3 5 1 2 2 1 1 1

Table 3: MSE-RATIO of β for p = 40.

25

collinearity no no no med. med. med. high high highstnr 1 3 7 1 3 7 1 2 71 0.781 0.797 0.791 0.877 0.983 0.924 1.013 1.004 1.0012 0.870 0.857 0.853 0.785 0.702 0.868 0.673 0.684 0.6803 0.853 0.899 0.914 0.776 0.818 0.853 0.778 0.772 0.7784 0.874 0.891 0.896 0.818 0.836 0.838 0.81 0.818 0.8225 0.889 0.92 0.893 0.839 0.891 0.846 0.835 0.855 0.8566 0.898 0.942 0.921 0.844 0.884 0.859 0.862 0.881 0.8817 0.897 0.938 0.929 0.876 0.902 0.88 0.886 0.898 0.8988 0.923 0.941 0.943 0.886 0.898 0.896 0.9 0.906 0.9149 0.924 0.944 0.960 0.904 0.916 0.901 0.915 0.92 0.91710 0.935 0.958 0.961 0.913 0.93 0.915 0.915 0.914 0.92111 0.943 0.967 0.959 0.922 0.937 0.916 0.924 0.92 0.92712 0.954 0.967 0.958 0.929 0.942 0.938 0.932 0.926 0.93113 0.959 0.967 0.965 0.941 0.95 0.942 0.939 0.933 0.93714 0.961 0.961 0.966 0.948 0.949 0.954 0.947 0.942 0.94215 0.97 0.969 0.977 0.954 0.953 0.96 0.953 0.948 0.94916 0.975 0.971 0.976 0.964 0.962 0.967 0.961 0.954 0.95717 0.979 0.976 0.983 0.968 0.962 0.974 0.967 0.957 0.95718 0.982 0.981 0.985 0.972 0.966 0.979 0.968 0.960 0.96619 0.986 0.985 0.988 0.976 0.969 0.980 0.974 0.965 0.97020 0.989 0.987 0.991 0.977 0.970 0.983 0.979 0.972 0.97421 0.991 0.99 0.992 0.980 0.973 0.985 0.984 0.977 0.97822 0.993 0.99 0.994 0.984 0.979 0.988 0.988 0.982 0.98123 0.995 0.992 0.996 0.987 0.98 0.991 0.990 0.986 0.98324 0.996 0.993 0.997 0.989 0.982 0.993 0.992 0.987 0.98425 0.996 0.995 0.997 0.99 0.983 0.994 0.993 0.989 0.98526 0.997 0.996 0.998 0.992 0.986 0.996 0.994 0.991 0.98727 0.998 0.997 0.999 0.994 0.990 0.997 0.995 0.993 0.98928 0.999 0.997 0.999 0.995 0.991 0.998 0.996 0.994 0.99129 0.999 0.998 0.999 0.996 0.992 0.998 0.997 0.996 0.99430 0.999 0.999 1.000 0.997 0.993 0.999 0.998 0.997 0.99431 0.999 0.999 1.000 0.998 0.994 0.999 0.998 0.997 0.99632 0.999 0.999 1.000 0.998 0.995 0.999 0.999 0.998 0.99733 1.000 0.999 1.000 0.998 0.996 0.999 0.999 0.998 0.99834 1.000 0.999 1.000 0.999 0.997 1.000 0.999 0.999 0.99835 1.000 1.000 1.000 0.999 0.998 1.000 0.999 0.999 0.99936 1.000 1.000 1.000 0.999 0.998 1.000 1.000 0.999 0.99937 1.000 1.000 1.000 1.000 0.998 1.000 1.000 0.999 0.99938 1.000 1.000 1.000 1.000 0.999 1.000 1.000 0.999 1.00039 1.000 1.000 1.000 1.000 0.999 1.000 1.000 1.000 1.000

moptPLS 1 3 9 1 1 2 1 1 1

moptTRN 1 3 4 1 2 2 1 1 1

Table 4: MSE-RATIO of y for p = 40.

26

collinearity no no no med. med. med. high high highstnr 1 3 7 1 3 7 1 2 71 0.845 0.763 0.825 0.862 0.839 0.906 1.017 1.002 1.0002 0.813 0.892 0.864 0.753 0.832 0.847 0.695 0.695 0.6943 0.863 0.884 0.881 0.788 0.837 0.859 0.806 0.808 0.8124 0.908 0.898 0.918 0.852 0.852 0.864 0.866 0.861 0.8655 0.933 0.940 0.954 0.889 0.888 0.900 0.900 0.903 0.9036 0.954 0.953 0.960 0.900 0.905 0.915 0.926 0.931 0.9317 0.964 0.967 0.979 0.927 0.935 0.930 0.951 0.953 0.9538 0.976 0.972 0.988 0.942 0.942 0.950 0.968 0.968 0.9709 0.984 0.982 0.993 0.959 0.963 0.961 0.979 0.968 0.97910 0.99 0.990 0.995 0.969 0.970 0.970 0.980 0.979 0.98111 0.994 0.991 0.998 0.976 0.979 0.978 0.987 0.987 0.98812 0.996 0.994 0.998 0.982 0.987 0.984 0.991 0.992 0.99313 0.997 0.995 0.999 0.988 0.991 0.990 0.994 0.994 0.99514 0.998 0.997 0.999 0.991 0.994 0.992 0.996 0.997 0.99715 0.999 0.998 1.000 0.994 0.995 0.994 0.997 0.998 0.99816 0.999 0.999 1.000 0.996 0.997 0.996 0.998 0.998 0.99917 1.000 0.999 1.000 0.997 0.998 0.997 0.999 0.999 0.99918 1.000 0.999 1.000 0.998 0.999 0.998 0.999 0.999 0.99919 1.000 1.000 1.000 0.999 0.999 0.999 0.999 1.000 1.00020 1.000 1.000 1.000 0.999 0.999 0.999 1.000 1.000 1.000

moptPLS 1 2 5 1 1 2 1 1 1

moptTRN 1 1 3 1 1 2 1 1 1

Table 5: MSE-RATIO of y for p = 100. We only display the results for the�rst 20 components, as the MSE-RATIO equals 1 (up to 4 digits after thedecimal point) for the remaining components.

An Overview on the Shrinkage Properties of Partial Least Squares Regression

Documents

Transcript of An Overview on the Shrinkage Properties of Partial Least Squares Regression