The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ........

The Degrees of Freedomof Partial Least Squares Regression

Dr. Nicole Kramer

TU Munchen

5th ESSEC-SUPELEC Research WorkshopMay 20, 2011

My talk is about ...

... the statistical analysis of Partial Least Squares Regression.

1. intrinsic complexity of PLS

2. comparison of regression methods

3. model selection

4. variable selection based on confidence intervals

The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 2 / 27

Example: Near Infrared Spectroscopy

Predict

Y the percentage of water in meat based on

X its near infra red spectrum.

I unknown linear relationship

f (x) = β0 + 〈β,x〉 = β0 +

p∑j=1

βjx(j) .

I We observe

yi ≈ f (xi ), i = 1, . . . , n .

centered data X =

x>1

.

.

.

x>n

∈ Rn×py = (y1, . . . , yn) ∈ Rn

.


Partial Least Squares in one Slide

I Partial Least Squares (PLS) =

1. supervised dimensionality reduction2. + least squares regression

X yn n

1

Tn

p

m

Least Squares Regression

superviseddimensionality reduction

Partial Least S

quares Regression

I The PLS components T have maximal covariance to theresponse variable y.

I The m� p components T are used as new predictor variablesin a least-squares fit.


Partial Least Squares Algorithm

Partial Least Squares (PLS) =

1. supervised dimensionality reduction

2. + least squares regression

PLS components T have maximal covariance to y.

X yn n

1

Tn

p

m



Partial Least S

quares Regression

Algorithm (NIPALS)

X1 = X. For i = 1, . . . ,m ← model parameter

1. wi ∝X>i y maximize covariance

2. ti ∝Xiwi latent component

3. Xi+1 = Xi − ti t>i Xi enforce orthogonality

Return T = (t1, . . . , tm) and W = (w1, . . . ,wm).

← cov(Xiwi ,y)

w>w=

w>i X>i y

w>w

βm = W (T>XW )−1T>y


Degrees of Freedom: Why are they important?

Degrees of Freedom (DoF)

1. capture the intrinsic complexity of a regression method.

Yi = f (xi ) + εi , εi ∼ N(0, σ2

),

2. are used for model selection.

test error = training error + complexity(DoF)

Examples:

I Bayesian Information Criterion

BIC =‖y − y‖2

n+

log(n)var (ε)

nDoF

I Akaike Information Criterion, Minimum Description Length, . . .


Definition: Degrees of Freedom

Assumption: The regression method is linear, i.e.

y = Hy, with H independent of y.

Degrees of Freedom

The Degrees of Freedom of a linear fitting method are

DoF = trace (H) .

Examples

I Principal Components Regression with m components

DoF(m) = 1 + m

I Ridge Regression, Smoothing splines, ...


Naive Degrees of Freedom for PLS?

I Recall: PLS is not linear in y

ym = y + T(T>T

)−1T>︸︷︷︸

=:H depends on y

y .

X yn n

1

Tn

p

m



Partial Least S

quares Regression

→ Degrees of Freedom 6= 1 + trace(H) = 1 + m.

I If we ignore the nonlinearity, we obtain

DoFnaive(m) = 1 + m .


Degrees of Freedom for PLS (K. & Braun, 2007; K. & Sugiyama, 2011)

The generalized Degrees of Freedom of a regression method are

DoF = EY

[trace

(∂y

∂y

)]This coincides with the previous definition if the method is linear.

Proposition

An unbiased estimate of the Degrees of Freedom of PLS is

DoF(m) = trace

(∂ym∂y

).

We need to compute (the trace of) the first derivative.


Computational Details I (K. & Braun, 2007)

m = 1 We compute the derivative along the lines of the PLSalgorithm

1. w1 = X>y

∂w1 = X>

2. t1 = Xw1

∂t1 = X∂w1 = XX>

3. y1 = Pt1y

∂y1 =1

‖t1‖2

(t1y

> + t>1 yIn)

(In − Pt1 ) ∂t1 + Pt1

m > 1 We rearrange the algorithm in terms of projections P ontovectors ti .


Computational Details II (K., Sugiyama, & Braun, 2009)

Define empirical scatter matrices

s = X>y and S = X>X

I w1, . . .wm is an orthogonal basis of the Krylov space

Km (S, s) = span(s,Ss, . . . ,Sm−1s

)︸︷︷︸=:Km

PLS computes orthogonal Gram-Schmidt basis of s,Ss, . . . ,Sm−1s

I Minimization property:

βm = arg minβ∈Km(S,s)

‖y −Xβ‖2 = Km

(K>mSKm

)−1K>ms

; explicit formula for the trace of ∂ym (but not for the derivative itself)


Summary Computational Details

Two equivalent algorithms

I implemented in the R-package plsdof.my.pls<-pls.model(X,y,compute.DoF=TRUE,compute.jacobian)

I option: runtime scales in p (number of variables) or in n(number of observations).

iterative projection on

computation of ∂ym Krylov subspaces

∂ym and ∂βm = A? yes no

confidence intervals yes no

for βm? cov(βm

)= σ2AA>

run time / ,The details can be found in the paper.


Shape of the DoF-Curve

Benchmark data with different correlation structures.

trainingdata set description variables examples examples correlationkin (fh) dynamics of a robot arm 32 8192 60 lowboston census data for housing 13 506 50 mediumcookie near infrared spectra 700 70 39 high

pls.object<-pls.model(X, y, compute.DoF=TRUE)

0 3 6 9 12 16 20 24 28 32

05

1015

2025

30

kin (fh)

number of components

Deg

rees

of F

reed

om

DoF(m)=m+1

0 1 2 3 4 5 6 7 8 9 11 13

24

68

1012

14

boston


Deg

rees

of F

reed

om

DoF(m)=m+1

0 3 6 9 12 16 20 24 28 320

1020

3040

cookie


Deg

rees

of F

reed

om

DoF(m)=m+1


A Lower Bound for the Degrees of Freedom

The lower is the collinearity,the higher are the Degrees of Freedom.

Theorem

If the largest eigenvalue λmax of the sample correlation matrix Sfulfills

2λmax ≤ trace(S)

then

DoF(m = 1) ≥ 1 +trace(S)

λmax.


Shape of the DoF-Curve

Benchmark data with different correlation structures.

trainingdata set description variables examples examples correlationkin (fh) dynamics of a robot arm 32 8192 60 lowboston census data for housing 13 506 50 mediumcookie near infrared spectra 700 70 39 high

pls.object<-pls.model(X, y, compute.DoF=TRUE)

*

********************************

0 5 10 15 20 25 30

05

1015

2025

30

kin (fh)


Deg

rees

of F

reed

om

DoF(m)=m+1

lower bound

*

*

*

*

* * * * * * * * * *

0 2 4 6 8 10 12

24

68

1012

14

boston


Deg

rees

of F

reed

om

DoF(m)=m+1

lower bound

******

***

***

***

****

*****************

0 5 10 15 20 25 30 350

1020

3040

5060

70

cookie


Deg

rees

of F

reed

om

DoF(m)=m+1


Comparison of Regression Methods: ozone

L.A. ozone pollution data

I p = 12 variables

I n = 203 observations, ntrain = 50 training observations

Comparison of

1. Partial Least Squares (PLS)

2. Principal Components Regression (PCR)

3. Ridge Regression

βridge = arg minβ

{‖y −Xβ‖2 + λ‖β‖2

}



mean-squared error and model complexity

PLS PCR RIDGE

20

22

24

26

28

30

32

mean squared error

PLS PCR

2

4

6

8

10

12components

PLS PCR RIDGE2

4

6

8

10

12

degrees of freedom

I There is no difference with respect to mean-squared error.

I A direct comparison of model parameters (number ofcomponents and λ) is not possible.

I Degrees of Freedom enable a fair model comparison betweenPLS and PCR.



o

oo

o o o o o o o o o2 4 6 8 10 12

1418

2226

components

trai

ning

err

or

x

x x x x x x xx

x x x

ox

PLSPCR

o

oo

ooooooooo2 4 6 8 10 12

1418

2226

degrees of freedom

trai

ning

err

or

x

x x x x x x xx

x x x

I ,,PLS fits closer than PCR.” (de Jong)

I However, there is no clear difference with respect to theDegrees of Freedom.

I PLS puts focus on more complex models.


Variable Selection for PLS

PLS does not select variables. ; extensions to sparse PLS

1. thresholding of the weight vectors (Saigo, K.,& Tsuda, 2008)

2. sparsity constraints on the weight vectors (Le Cao et. al. , 2008; Chun &

Keles, 2010)

3. shrinkage (Kondylis & Whittaker, 2007)

4. . . .

Classical approaches

5. bootstrapping R packages pls and ppls

6. hypothesis testing based on the distribution of β


Approximate distribution of β

Recall: Regression coefficients are a non-linear function of y.

I First order Taylor approximation

β ≈ ∂β

∂y︸︷︷︸=:A

y

I approximate covariance matrix

cov(β)

= σ2AA>

I The noise level can be estimated via

σ2 =‖y − y‖2

n − DoF


Confidence Intervals for PLS: ozone and tecator

I implemented in the R-packages plsdof and multcompcv.object<-pls.cv(X, y, compute.covariance=TRUE)

my.multcomp<-glht(cv.object,...)

−5 0 5 10

12

11

10

9

8

7

6

5

4

3

2

1 (((

((

((

((

((

(

)))))

))

)))

))

●

●

●

●

●

●

●

●

●

●

●

●

95% family−wise confidence level

−10

−5

05

100999897969594939291908988878685848382818079787776757473727170696867666564636261605958575655545352515049484746454443424140393837363534333231302928272625242322212019181716151413121110987654321

( (( (

( ( (( ( ( ( ( ( ( ( ( ( (

((

((

( ( ( ( ((

((

((

((

((

((

( ( ( ((

((

((

((

( (( ( ( ( ( ( ( ( ( ( ( ( (( ( ( ( (

( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( (

) )) ) )

) ) ) ) ) ) ) ) ) ) ) ))

))

))

) ) ) ) ))

))

))

))

))

))

) ) ) ))

))

))

))

)) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )

)

●●

●●

●● ● ● ● ● ● ● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●● ● ●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●

●●

●●

● ● ● ● ● ● ●●

● ● ● ● ● ●●

●

95%

fam

ily−w

ise

conf

iden

ce le

vel

I The function corrects for multiple comparisons.

I The computational cost is high for large number of variables.


Model Selection

Comparison of

1. 10-fold cross-validation (gold standard)cv.object<-pls.cv(X, y, k=10)

2. Bayesian Information Criterion with our DoF-estimatebic.object<-pls.ic(X, y, criterion=’’bic’’)

3. Bayesian Information Criterion with the naive estimateDoF=m+1naive.object<-pls.ic(X, y, criterion=’’bic’’, naive=TRUE)

Akaike Information Criterion and Minimum Description Length are also available in the R-package.

data sets: kin (fh), Boston, cookie


Prediction Accuracy

I 10-fold cross-validation

I Bayesian Information Criterion with our DoF-estimate

I Bayesian Information Criterion with the naive estimate DoF=m+1

CV BIC BIC naive

8

10

12

14

16

18

kin (fh)

test

err

or *

100

CV BIC BIC naive

25

30

35

40

45

boston

test

err

or

CV BIC BIC naive

10

15

20

25

30

cookie

test

err

or *

100

1. All three approaches obtain similar accuracy.

2. There is no clear difference between BIC and naive BIC.The plots look similar for the selected Degrees of Freedom.


Model complexity (selected components)

I 10-fold cross-validation

I Bayesian Information Criterion with our DoF-estimate

I Bayesian Information Criterion with the naive estimate DoF=m+1

CV BIC BIC naive

0

1

2

3

4

5

6

kin (fh)

num

ber

of c

ompo

nent

s

CV BIC BIC naive

1

2

3

4

5

6

7

boston

num

ber

of c

ompo

nent

s

CV BIC BIC naive0

5

10

15

20

25

30

35

cookie

num

ber

of c

ompo

nent

s

1. BIC selects less complex models than naive BIC.

2. There is no clear difference between BIC and CV.The plots look similar for the selected Degrees of Freedom.


Why not use the naive approach?

I The naive approach selects more components, but themean-squared error is not higher.; no overfitting?

I The test error curve can be flat or steep around the optimum.

x

xxxxxxxxxxxxxxxxxxxxxxxxxxxxx

0 5 10 15 20 25 30

12

34

5

scaled test error

componenents

ooooooo

oooooo

oooo

ooo

oooooo

ooo

o

xo

d=50d=210

I Depending on the form of the curve, the selection of toocomplex models does lead to overfitting.

More details in the paper.


Summary

Partial Least Squares ...

I typically consumes more than one Degree of Freedom for eachcomponent.

→ precise estimate of its intrinsic complexity

DoF(m) = trace

(∂ym∂y

)Its Degrees of Freedom ...

I allow us to compare different regression methods.

I select less complex models than the naive estimate (whencombined with information criteria).

Variables can be selected ...

I by constructing approximate confidence intervals.


References

Kramer, N. and Sugiyama, M. (2011).The Degrees of Freedom of Partial Least Squares RegressionJournal of the American Statistical Association, in press

Kramer, N. and Braun, M. L. (2010).plsdof: Degrees of Freedom and Confidence Intervals for Partial Least SquaresR package version 0.2-2

Kramer, N., Sugiyama, M. Braun, M.L. (2009).Lanczos Approximations for the Speedup of Kernel Partial Least SquaresRegressionTwelfth International Conference on Artificial Intelligence and Statistics (AISTATS)

Kramer, N., Braun, M.L. (2007)Kernelizing PLS, Degrees of Freedom, and Efficient Model Selection24th International Conference on Machine Learning (ICML)


http://pubs.amstat.org/doi/abs/10.1198/jasa.2011.tm10107

http://cran.at.r-project.org/web/packages/plsdof/index.html

http://jmlr.csail.mit.edu/proceedings/papers/v5/kramer09a/kramer09a.pdf

http://jmlr.csail.mit.edu/proceedings/papers/v5/kramer09a/kramer09a.pdf

http://portal.acm.org/citation.cfm?id=1273552

The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ........

Documents

Transcript of The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ........