The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ........

27
The Degrees of Freedom of Partial Least Squares Regression Dr. Nicole Kr¨ amer TU M¨ unchen 5th ESSEC-SUPELEC Research Workshop May 20, 2011

Transcript of The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ........

Page 1: The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ..... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity

The Degrees of Freedomof Partial Least Squares Regression

Dr. Nicole Kramer

TU Munchen

5th ESSEC-SUPELEC Research WorkshopMay 20, 2011

Page 2: The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ..... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity

My talk is about ...

... the statistical analysis of Partial Least Squares Regression.

1. intrinsic complexity of PLS

2. comparison of regression methods

3. model selection

4. variable selection based on confidence intervals

The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 2 / 27

Page 3: The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ..... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity

Example: Near Infrared Spectroscopy

Predict

Y the percentage of water in meat based on

X its near infra red spectrum.

I unknown linear relationship

f (x) = β0 + 〈β,x〉 = β0 +

p∑j=1

βjx(j) .

I We observe

yi ≈ f (xi ), i = 1, . . . , n .

centered data X =

x>1

.

.

.

x>n

∈ Rn×py = (y1, . . . , yn) ∈ Rn

.

The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 3 / 27

Page 4: The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ..... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity

Partial Least Squares in one Slide

I Partial Least Squares (PLS) =

1. supervised dimensionality reduction2. + least squares regression

X yn n

1

Tn

p

m

Least Squares Regression

superviseddimensionality reduction

Partial Least S

quares Regression

I The PLS components T have maximal covariance to theresponse variable y.

I The m� p components T are used as new predictor variablesin a least-squares fit.

The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 4 / 27

Page 5: The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ..... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity

Partial Least Squares Algorithm

Partial Least Squares (PLS) =

1. supervised dimensionality reduction

2. + least squares regression

PLS components T have maximal covariance to y.

X yn n

1

Tn

p

m

Least Squares Regression

superviseddimensionality reduction

Partial Least S

quares Regression

Algorithm (NIPALS)

X1 = X. For i = 1, . . . ,m ← model parameter

1. wi ∝X>i y maximize covariance

2. ti ∝Xiwi latent component

3. Xi+1 = Xi − ti t>i Xi enforce orthogonality

Return T = (t1, . . . , tm) and W = (w1, . . . ,wm).

← cov(Xiwi ,y)

w>w=

w>i X>i y

w>w

βm = W (T>XW )−1T>y

The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 5 / 27

Page 6: The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ..... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity

Degrees of Freedom: Why are they important?

Degrees of Freedom (DoF)

1. capture the intrinsic complexity of a regression method.

Yi = f (xi ) + εi , εi ∼ N(0, σ2

),

2. are used for model selection.

test error = training error + complexity(DoF)

Examples:

I Bayesian Information Criterion

BIC =‖y − y‖2

n+

log(n)var (ε)

nDoF

I Akaike Information Criterion, Minimum Description Length, . . .

The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 6 / 27

Page 7: The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ..... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity

Definition: Degrees of Freedom

Assumption: The regression method is linear, i.e.

y = Hy, with H independent of y.

Degrees of Freedom

The Degrees of Freedom of a linear fitting method are

DoF = trace (H) .

Examples

I Principal Components Regression with m components

DoF(m) = 1 + m

I Ridge Regression, Smoothing splines, ...

The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 7 / 27

Page 8: The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ..... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity

Naive Degrees of Freedom for PLS?

I Recall: PLS is not linear in y

ym = y + T(T>T

)−1T>︸ ︷︷ ︸

=:H depends on y

y .

X yn n

1

Tn

p

m

Least Squares Regression

superviseddimensionality reduction

Partial Least S

quares Regression

→ Degrees of Freedom 6= 1 + trace(H) = 1 + m.

I If we ignore the nonlinearity, we obtain

DoFnaive(m) = 1 + m .

The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 8 / 27

Page 9: The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ..... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity

Degrees of Freedom for PLS (K. & Braun, 2007; K. & Sugiyama, 2011)

The generalized Degrees of Freedom of a regression method are

DoF = EY

[trace

(∂y

∂y

)]This coincides with the previous definition if the method is linear.

Proposition

An unbiased estimate of the Degrees of Freedom of PLS is

DoF(m) = trace

(∂ym∂y

).

We need to compute (the trace of) the first derivative.

The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 9 / 27

Page 10: The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ..... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity

Computational Details I (K. & Braun, 2007)

m = 1 We compute the derivative along the lines of the PLSalgorithm

1. w1 = X>y

∂w1 = X>

2. t1 = Xw1

∂t1 = X∂w1 = XX>

3. y1 = Pt1y

∂y1 =1

‖t1‖2

(t1y

> + t>1 yIn)

(In − Pt1 ) ∂t1 + Pt1

m > 1 We rearrange the algorithm in terms of projections P ontovectors ti .

The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 10 / 27

Page 11: The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ..... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity

Computational Details II (K., Sugiyama, & Braun, 2009)

Define empirical scatter matrices

s = X>y and S = X>X

I w1, . . .wm is an orthogonal basis of the Krylov space

Km (S, s) = span(s,Ss, . . . ,Sm−1s

)︸ ︷︷ ︸=:Km

PLS computes orthogonal Gram-Schmidt basis of s,Ss, . . . ,Sm−1s

I Minimization property:

βm = arg minβ∈Km(S,s)

‖y −Xβ‖2 = Km

(K>mSKm

)−1K>ms

; explicit formula for the trace of ∂ym (but not for the derivative itself)

The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 11 / 27

Page 12: The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ..... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity

Summary Computational Details

Two equivalent algorithms

I implemented in the R-package plsdof.my.pls<-pls.model(X,y,compute.DoF=TRUE,compute.jacobian)

I option: runtime scales in p (number of variables) or in n(number of observations).

iterative projection on

computation of ∂ym Krylov subspaces

∂ym and ∂βm = A? yes no

confidence intervals yes no

for βm? cov(βm

)= σ2AA>

run time / ,The details can be found in the paper.

The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 12 / 27

Page 13: The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ..... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity

Shape of the DoF-Curve

Benchmark data with different correlation structures.

trainingdata set description variables examples examples correlationkin (fh) dynamics of a robot arm 32 8192 60 lowboston census data for housing 13 506 50 mediumcookie near infrared spectra 700 70 39 high

pls.object<-pls.model(X, y, compute.DoF=TRUE)

0 3 6 9 12 16 20 24 28 32

05

1015

2025

30

kin (fh)

number of components

Deg

rees

of F

reed

om

DoF(m)=m+1

0 1 2 3 4 5 6 7 8 9 11 13

24

68

1012

14

boston

number of components

Deg

rees

of F

reed

om

DoF(m)=m+1

0 3 6 9 12 16 20 24 28 320

1020

3040

cookie

number of components

Deg

rees

of F

reed

om

DoF(m)=m+1

The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 13 / 27

Page 14: The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ..... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity

A Lower Bound for the Degrees of Freedom

The lower is the collinearity,the higher are the Degrees of Freedom.

Theorem

If the largest eigenvalue λmax of the sample correlation matrix Sfulfills

2λmax ≤ trace(S)

then

DoF(m = 1) ≥ 1 +trace(S)

λmax.

The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 14 / 27

Page 15: The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ..... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity

Shape of the DoF-Curve

Benchmark data with different correlation structures.

trainingdata set description variables examples examples correlationkin (fh) dynamics of a robot arm 32 8192 60 lowboston census data for housing 13 506 50 mediumcookie near infrared spectra 700 70 39 high

pls.object<-pls.model(X, y, compute.DoF=TRUE)

*

********************************

0 5 10 15 20 25 30

05

1015

2025

30

kin (fh)

number of components

Deg

rees

of F

reed

om

DoF(m)=m+1

lower bound

*

*

*

*

* * * * * * * * * *

0 2 4 6 8 10 12

24

68

1012

14

boston

number of components

Deg

rees

of F

reed

om

DoF(m)=m+1

lower bound

******

***

***

***

****

*****************

0 5 10 15 20 25 30 350

1020

3040

5060

70

cookie

number of components

Deg

rees

of F

reed

om

DoF(m)=m+1

The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 15 / 27

Page 16: The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ..... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity

Comparison of Regression Methods: ozone

L.A. ozone pollution data

I p = 12 variables

I n = 203 observations, ntrain = 50 training observations

Comparison of

1. Partial Least Squares (PLS)

2. Principal Components Regression (PCR)

3. Ridge Regression

βridge = arg minβ

{‖y −Xβ‖2 + λ‖β‖2

}

The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 16 / 27

Page 17: The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ..... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity

Comparison of Regression Methods: ozone

mean-squared error and model complexity

PLS PCR RIDGE

20

22

24

26

28

30

32

mean squared error

PLS PCR

2

4

6

8

10

12components

PLS PCR RIDGE2

4

6

8

10

12

degrees of freedom

I There is no difference with respect to mean-squared error.

I A direct comparison of model parameters (number ofcomponents and λ) is not possible.

I Degrees of Freedom enable a fair model comparison betweenPLS and PCR.

The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 17 / 27

Page 18: The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ..... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity

Comparison of Regression Methods: ozone

o

oo

o o o o o o o o o2 4 6 8 10 12

1418

2226

components

trai

ning

err

or

x

x x x x x x xx

x x x

ox

PLSPCR

o

oo

ooooooooo2 4 6 8 10 12

1418

2226

degrees of freedom

trai

ning

err

or

x

x x x x x x xx

x x x

I ,,PLS fits closer than PCR.” (de Jong)

I However, there is no clear difference with respect to theDegrees of Freedom.

I PLS puts focus on more complex models.

The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 18 / 27

Page 19: The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ..... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity

Variable Selection for PLS

PLS does not select variables. ; extensions to sparse PLS

1. thresholding of the weight vectors (Saigo, K.,& Tsuda, 2008)

2. sparsity constraints on the weight vectors (Le Cao et. al. , 2008; Chun &

Keles, 2010)

3. shrinkage (Kondylis & Whittaker, 2007)

4. . . .

Classical approaches

5. bootstrapping R packages pls and ppls

6. hypothesis testing based on the distribution of β

The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 19 / 27

Page 20: The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ..... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity

Approximate distribution of β

Recall: Regression coefficients are a non-linear function of y.

I First order Taylor approximation

β ≈ ∂β

∂y︸︷︷︸=:A

y

I approximate covariance matrix

cov(β)

= σ2AA>

I The noise level can be estimated via

σ2 =‖y − y‖2

n − DoF

The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 20 / 27

Page 21: The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ..... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity

Confidence Intervals for PLS: ozone and tecator

I implemented in the R-packages plsdof and multcompcv.object<-pls.cv(X, y, compute.covariance=TRUE)

my.multcomp<-glht(cv.object,...)

−5 0 5 10

12

11

10

9

8

7

6

5

4

3

2

1 (((

((

((

((

((

(

)))))

))

)))

))

95% family−wise confidence level

−10

−5

05

100999897969594939291908988878685848382818079787776757473727170696867666564636261605958575655545352515049484746454443424140393837363534333231302928272625242322212019181716151413121110987654321

( (( (

( ( (( ( ( ( ( ( ( ( ( ( (

((

((

( ( ( ( ((

((

((

((

((

((

( ( ( ((

((

((

((

( (( ( ( ( ( ( ( ( ( ( ( ( (( ( ( ( (

( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( (

) )) ) )

) ) ) ) ) ) ) ) ) ) ) ))

))

))

) ) ) ) ))

))

))

))

))

))

) ) ) ))

))

))

))

)) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )

)

●●

●●

●● ● ● ● ● ● ● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●● ● ●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●

●●

●●

● ● ● ● ● ● ●●

● ● ● ● ● ●●

95%

fam

ily−w

ise

conf

iden

ce le

vel

I The function corrects for multiple comparisons.

I The computational cost is high for large number of variables.

The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 21 / 27

Page 22: The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ..... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity

Model Selection

Comparison of

1. 10-fold cross-validation (gold standard)cv.object<-pls.cv(X, y, k=10)

2. Bayesian Information Criterion with our DoF-estimatebic.object<-pls.ic(X, y, criterion=’’bic’’)

3. Bayesian Information Criterion with the naive estimateDoF=m+1naive.object<-pls.ic(X, y, criterion=’’bic’’, naive=TRUE)

Akaike Information Criterion and Minimum Description Length are also available in the R-package.

data sets: kin (fh), Boston, cookie

The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 22 / 27

Page 23: The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ..... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity

Prediction Accuracy

I 10-fold cross-validation

I Bayesian Information Criterion with our DoF-estimate

I Bayesian Information Criterion with the naive estimate DoF=m+1

CV BIC BIC naive

8

10

12

14

16

18

kin (fh)

test

err

or *

100

CV BIC BIC naive

25

30

35

40

45

boston

test

err

or

CV BIC BIC naive

10

15

20

25

30

cookie

test

err

or *

100

1. All three approaches obtain similar accuracy.

2. There is no clear difference between BIC and naive BIC.The plots look similar for the selected Degrees of Freedom.

The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 23 / 27

Page 24: The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ..... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity

Model complexity (selected components)

I 10-fold cross-validation

I Bayesian Information Criterion with our DoF-estimate

I Bayesian Information Criterion with the naive estimate DoF=m+1

CV BIC BIC naive

0

1

2

3

4

5

6

kin (fh)

num

ber

of c

ompo

nent

s

CV BIC BIC naive

1

2

3

4

5

6

7

boston

num

ber

of c

ompo

nent

s

CV BIC BIC naive0

5

10

15

20

25

30

35

cookie

num

ber

of c

ompo

nent

s

1. BIC selects less complex models than naive BIC.

2. There is no clear difference between BIC and CV.The plots look similar for the selected Degrees of Freedom.

The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 24 / 27

Page 25: The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ..... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity

Why not use the naive approach?

I The naive approach selects more components, but themean-squared error is not higher.; no overfitting?

I The test error curve can be flat or steep around the optimum.

x

xxxxxxxxxxxxxxxxxxxxxxxxxxxxx

0 5 10 15 20 25 30

12

34

5

scaled test error

componenents

ooooooo

oooooo

oooo

ooo

oooooo

ooo

o

xo

d=50d=210

I Depending on the form of the curve, the selection of toocomplex models does lead to overfitting.

More details in the paper.

The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 25 / 27

Page 26: The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ..... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity

Summary

Partial Least Squares ...

I typically consumes more than one Degree of Freedom for eachcomponent.

→ precise estimate of its intrinsic complexity

DoF(m) = trace

(∂ym∂y

)Its Degrees of Freedom ...

I allow us to compare different regression methods.

I select less complex models than the naive estimate (whencombined with information criteria).

Variables can be selected ...

I by constructing approximate confidence intervals.

The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 26 / 27

Page 27: The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ..... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity

References

Kramer, N. and Sugiyama, M. (2011).The Degrees of Freedom of Partial Least Squares RegressionJournal of the American Statistical Association, in press

Kramer, N. and Braun, M. L. (2010).plsdof: Degrees of Freedom and Confidence Intervals for Partial Least SquaresR package version 0.2-2

Kramer, N., Sugiyama, M. Braun, M.L. (2009).Lanczos Approximations for the Speedup of Kernel Partial Least SquaresRegressionTwelfth International Conference on Artificial Intelligence and Statistics (AISTATS)

Kramer, N., Braun, M.L. (2007)Kernelizing PLS, Degrees of Freedom, and Efficient Model Selection24th International Conference on Machine Learning (ICML)

The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 27 / 27