From linearity to nonlinear additive spline modeling in Partial Least-Squares regression
The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ........
Transcript of The Degrees of Freedom of Partial Least Squares Regression · 2013-04-23 · My talk is about ........
The Degrees of Freedomof Partial Least Squares Regression
Dr. Nicole Kramer
TU Munchen
5th ESSEC-SUPELEC Research WorkshopMay 20, 2011
My talk is about ...
... the statistical analysis of Partial Least Squares Regression.
1. intrinsic complexity of PLS
2. comparison of regression methods
3. model selection
4. variable selection based on confidence intervals
The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 2 / 27
Example: Near Infrared Spectroscopy
Predict
Y the percentage of water in meat based on
X its near infra red spectrum.
I unknown linear relationship
f (x) = β0 + 〈β,x〉 = β0 +
p∑j=1
βjx(j) .
I We observe
yi ≈ f (xi ), i = 1, . . . , n .
centered data X =
x>1
.
.
.
x>n
∈ Rn×py = (y1, . . . , yn) ∈ Rn
.
The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 3 / 27
Partial Least Squares in one Slide
I Partial Least Squares (PLS) =
1. supervised dimensionality reduction2. + least squares regression
X yn n
1
Tn
p
m
Least Squares Regression
superviseddimensionality reduction
Partial Least S
quares Regression
I The PLS components T have maximal covariance to theresponse variable y.
I The m� p components T are used as new predictor variablesin a least-squares fit.
The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 4 / 27
Partial Least Squares Algorithm
Partial Least Squares (PLS) =
1. supervised dimensionality reduction
2. + least squares regression
PLS components T have maximal covariance to y.
X yn n
1
Tn
p
m
Least Squares Regression
superviseddimensionality reduction
Partial Least S
quares Regression
Algorithm (NIPALS)
X1 = X. For i = 1, . . . ,m ← model parameter
1. wi ∝X>i y maximize covariance
2. ti ∝Xiwi latent component
3. Xi+1 = Xi − ti t>i Xi enforce orthogonality
Return T = (t1, . . . , tm) and W = (w1, . . . ,wm).
← cov(Xiwi ,y)
w>w=
w>i X>i y
w>w
βm = W (T>XW )−1T>y
The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 5 / 27
Degrees of Freedom: Why are they important?
Degrees of Freedom (DoF)
1. capture the intrinsic complexity of a regression method.
Yi = f (xi ) + εi , εi ∼ N(0, σ2
),
2. are used for model selection.
test error = training error + complexity(DoF)
Examples:
I Bayesian Information Criterion
BIC =‖y − y‖2
n+
log(n)var (ε)
nDoF
I Akaike Information Criterion, Minimum Description Length, . . .
The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 6 / 27
Definition: Degrees of Freedom
Assumption: The regression method is linear, i.e.
y = Hy, with H independent of y.
Degrees of Freedom
The Degrees of Freedom of a linear fitting method are
DoF = trace (H) .
Examples
I Principal Components Regression with m components
DoF(m) = 1 + m
I Ridge Regression, Smoothing splines, ...
The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 7 / 27
Naive Degrees of Freedom for PLS?
I Recall: PLS is not linear in y
ym = y + T(T>T
)−1T>︸ ︷︷ ︸
=:H depends on y
y .
X yn n
1
Tn
p
m
Least Squares Regression
superviseddimensionality reduction
Partial Least S
quares Regression
→ Degrees of Freedom 6= 1 + trace(H) = 1 + m.
I If we ignore the nonlinearity, we obtain
DoFnaive(m) = 1 + m .
The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 8 / 27
Degrees of Freedom for PLS (K. & Braun, 2007; K. & Sugiyama, 2011)
The generalized Degrees of Freedom of a regression method are
DoF = EY
[trace
(∂y
∂y
)]This coincides with the previous definition if the method is linear.
Proposition
An unbiased estimate of the Degrees of Freedom of PLS is
DoF(m) = trace
(∂ym∂y
).
We need to compute (the trace of) the first derivative.
The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 9 / 27
Computational Details I (K. & Braun, 2007)
m = 1 We compute the derivative along the lines of the PLSalgorithm
1. w1 = X>y
∂w1 = X>
2. t1 = Xw1
∂t1 = X∂w1 = XX>
3. y1 = Pt1y
∂y1 =1
‖t1‖2
(t1y
> + t>1 yIn)
(In − Pt1 ) ∂t1 + Pt1
m > 1 We rearrange the algorithm in terms of projections P ontovectors ti .
The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 10 / 27
Computational Details II (K., Sugiyama, & Braun, 2009)
Define empirical scatter matrices
s = X>y and S = X>X
I w1, . . .wm is an orthogonal basis of the Krylov space
Km (S, s) = span(s,Ss, . . . ,Sm−1s
)︸ ︷︷ ︸=:Km
PLS computes orthogonal Gram-Schmidt basis of s,Ss, . . . ,Sm−1s
I Minimization property:
βm = arg minβ∈Km(S,s)
‖y −Xβ‖2 = Km
(K>mSKm
)−1K>ms
; explicit formula for the trace of ∂ym (but not for the derivative itself)
The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 11 / 27
Summary Computational Details
Two equivalent algorithms
I implemented in the R-package plsdof.my.pls<-pls.model(X,y,compute.DoF=TRUE,compute.jacobian)
I option: runtime scales in p (number of variables) or in n(number of observations).
iterative projection on
computation of ∂ym Krylov subspaces
∂ym and ∂βm = A? yes no
confidence intervals yes no
for βm? cov(βm
)= σ2AA>
run time / ,The details can be found in the paper.
The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 12 / 27
Shape of the DoF-Curve
Benchmark data with different correlation structures.
trainingdata set description variables examples examples correlationkin (fh) dynamics of a robot arm 32 8192 60 lowboston census data for housing 13 506 50 mediumcookie near infrared spectra 700 70 39 high
pls.object<-pls.model(X, y, compute.DoF=TRUE)
0 3 6 9 12 16 20 24 28 32
05
1015
2025
30
kin (fh)
number of components
Deg
rees
of F
reed
om
DoF(m)=m+1
0 1 2 3 4 5 6 7 8 9 11 13
24
68
1012
14
boston
number of components
Deg
rees
of F
reed
om
DoF(m)=m+1
0 3 6 9 12 16 20 24 28 320
1020
3040
cookie
number of components
Deg
rees
of F
reed
om
DoF(m)=m+1
The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 13 / 27
A Lower Bound for the Degrees of Freedom
The lower is the collinearity,the higher are the Degrees of Freedom.
Theorem
If the largest eigenvalue λmax of the sample correlation matrix Sfulfills
2λmax ≤ trace(S)
then
DoF(m = 1) ≥ 1 +trace(S)
λmax.
The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 14 / 27
Shape of the DoF-Curve
Benchmark data with different correlation structures.
trainingdata set description variables examples examples correlationkin (fh) dynamics of a robot arm 32 8192 60 lowboston census data for housing 13 506 50 mediumcookie near infrared spectra 700 70 39 high
pls.object<-pls.model(X, y, compute.DoF=TRUE)
*
********************************
0 5 10 15 20 25 30
05
1015
2025
30
kin (fh)
number of components
Deg
rees
of F
reed
om
DoF(m)=m+1
lower bound
*
*
*
*
* * * * * * * * * *
0 2 4 6 8 10 12
24
68
1012
14
boston
number of components
Deg
rees
of F
reed
om
DoF(m)=m+1
lower bound
******
***
***
***
****
*****************
0 5 10 15 20 25 30 350
1020
3040
5060
70
cookie
number of components
Deg
rees
of F
reed
om
DoF(m)=m+1
The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 15 / 27
Comparison of Regression Methods: ozone
L.A. ozone pollution data
I p = 12 variables
I n = 203 observations, ntrain = 50 training observations
Comparison of
1. Partial Least Squares (PLS)
2. Principal Components Regression (PCR)
3. Ridge Regression
βridge = arg minβ
{‖y −Xβ‖2 + λ‖β‖2
}
The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 16 / 27
Comparison of Regression Methods: ozone
mean-squared error and model complexity
PLS PCR RIDGE
20
22
24
26
28
30
32
mean squared error
PLS PCR
2
4
6
8
10
12components
PLS PCR RIDGE2
4
6
8
10
12
degrees of freedom
I There is no difference with respect to mean-squared error.
I A direct comparison of model parameters (number ofcomponents and λ) is not possible.
I Degrees of Freedom enable a fair model comparison betweenPLS and PCR.
The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 17 / 27
Comparison of Regression Methods: ozone
o
oo
o o o o o o o o o2 4 6 8 10 12
1418
2226
components
trai
ning
err
or
x
x x x x x x xx
x x x
ox
PLSPCR
o
oo
ooooooooo2 4 6 8 10 12
1418
2226
degrees of freedom
trai
ning
err
or
x
x x x x x x xx
x x x
I ,,PLS fits closer than PCR.” (de Jong)
I However, there is no clear difference with respect to theDegrees of Freedom.
I PLS puts focus on more complex models.
The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 18 / 27
Variable Selection for PLS
PLS does not select variables. ; extensions to sparse PLS
1. thresholding of the weight vectors (Saigo, K.,& Tsuda, 2008)
2. sparsity constraints on the weight vectors (Le Cao et. al. , 2008; Chun &
Keles, 2010)
3. shrinkage (Kondylis & Whittaker, 2007)
4. . . .
Classical approaches
5. bootstrapping R packages pls and ppls
6. hypothesis testing based on the distribution of β
The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 19 / 27
Approximate distribution of β
Recall: Regression coefficients are a non-linear function of y.
I First order Taylor approximation
β ≈ ∂β
∂y︸︷︷︸=:A
y
I approximate covariance matrix
cov(β)
= σ2AA>
I The noise level can be estimated via
σ2 =‖y − y‖2
n − DoF
The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 20 / 27
Confidence Intervals for PLS: ozone and tecator
I implemented in the R-packages plsdof and multcompcv.object<-pls.cv(X, y, compute.covariance=TRUE)
my.multcomp<-glht(cv.object,...)
−5 0 5 10
12
11
10
9
8
7
6
5
4
3
2
1 (((
((
((
((
((
(
)))))
))
)))
))
●
●
●
●
●
●
●
●
●
●
●
●
95% family−wise confidence level
−10
−5
05
100999897969594939291908988878685848382818079787776757473727170696867666564636261605958575655545352515049484746454443424140393837363534333231302928272625242322212019181716151413121110987654321
( (( (
( ( (( ( ( ( ( ( ( ( ( ( (
((
((
( ( ( ( ((
((
((
((
((
((
( ( ( ((
((
((
((
( (( ( ( ( ( ( ( ( ( ( ( ( (( ( ( ( (
( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( (
) )) ) )
) ) ) ) ) ) ) ) ) ) ) ))
))
))
) ) ) ) ))
))
))
))
))
))
) ) ) ))
))
))
))
)) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )
)
●●
●●
●● ● ● ● ● ● ● ●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
● ●●
●●
●●
●●
●●
●● ● ●
●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●
●●
●●
● ● ● ● ● ● ●●
● ● ● ● ● ●●
●
95%
fam
ily−w
ise
conf
iden
ce le
vel
I The function corrects for multiple comparisons.
I The computational cost is high for large number of variables.
The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 21 / 27
Model Selection
Comparison of
1. 10-fold cross-validation (gold standard)cv.object<-pls.cv(X, y, k=10)
2. Bayesian Information Criterion with our DoF-estimatebic.object<-pls.ic(X, y, criterion=’’bic’’)
3. Bayesian Information Criterion with the naive estimateDoF=m+1naive.object<-pls.ic(X, y, criterion=’’bic’’, naive=TRUE)
Akaike Information Criterion and Minimum Description Length are also available in the R-package.
data sets: kin (fh), Boston, cookie
The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 22 / 27
Prediction Accuracy
I 10-fold cross-validation
I Bayesian Information Criterion with our DoF-estimate
I Bayesian Information Criterion with the naive estimate DoF=m+1
CV BIC BIC naive
8
10
12
14
16
18
kin (fh)
test
err
or *
100
CV BIC BIC naive
25
30
35
40
45
boston
test
err
or
CV BIC BIC naive
10
15
20
25
30
cookie
test
err
or *
100
1. All three approaches obtain similar accuracy.
2. There is no clear difference between BIC and naive BIC.The plots look similar for the selected Degrees of Freedom.
The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 23 / 27
Model complexity (selected components)
I 10-fold cross-validation
I Bayesian Information Criterion with our DoF-estimate
I Bayesian Information Criterion with the naive estimate DoF=m+1
CV BIC BIC naive
0
1
2
3
4
5
6
kin (fh)
num
ber
of c
ompo
nent
s
CV BIC BIC naive
1
2
3
4
5
6
7
boston
num
ber
of c
ompo
nent
s
CV BIC BIC naive0
5
10
15
20
25
30
35
cookie
num
ber
of c
ompo
nent
s
1. BIC selects less complex models than naive BIC.
2. There is no clear difference between BIC and CV.The plots look similar for the selected Degrees of Freedom.
The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 24 / 27
Why not use the naive approach?
I The naive approach selects more components, but themean-squared error is not higher.; no overfitting?
I The test error curve can be flat or steep around the optimum.
x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
0 5 10 15 20 25 30
12
34
5
scaled test error
componenents
ooooooo
oooooo
oooo
ooo
oooooo
ooo
o
xo
d=50d=210
I Depending on the form of the curve, the selection of toocomplex models does lead to overfitting.
More details in the paper.
The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 25 / 27
Summary
Partial Least Squares ...
I typically consumes more than one Degree of Freedom for eachcomponent.
→ precise estimate of its intrinsic complexity
DoF(m) = trace
(∂ym∂y
)Its Degrees of Freedom ...
I allow us to compare different regression methods.
I select less complex models than the naive estimate (whencombined with information criteria).
Variables can be selected ...
I by constructing approximate confidence intervals.
The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 26 / 27
References
Kramer, N. and Sugiyama, M. (2011).The Degrees of Freedom of Partial Least Squares RegressionJournal of the American Statistical Association, in press
Kramer, N. and Braun, M. L. (2010).plsdof: Degrees of Freedom and Confidence Intervals for Partial Least SquaresR package version 0.2-2
Kramer, N., Sugiyama, M. Braun, M.L. (2009).Lanczos Approximations for the Speedup of Kernel Partial Least SquaresRegressionTwelfth International Conference on Artificial Intelligence and Statistics (AISTATS)
Kramer, N., Braun, M.L. (2007)Kernelizing PLS, Degrees of Freedom, and Efficient Model Selection24th International Conference on Machine Learning (ICML)
The Degrees of Freedom of PLS Dr. Nicole Kramer (TUM) 27 / 27