Linear Methods for Regression
description
Transcript of Linear Methods for Regression
![Page 1: Linear Methods for Regression](https://reader035.fdocuments.in/reader035/viewer/2022062304/5681449c550346895db1499b/html5/thumbnails/1.jpg)
1
Linear Methods for Regression
Lecture Notes for CMPUT 466/551
Nilanjan Ray
![Page 2: Linear Methods for Regression](https://reader035.fdocuments.in/reader035/viewer/2022062304/5681449c550346895db1499b/html5/thumbnails/2.jpg)
2
Assumption: Linear Regression Function
NNpN
p
p
y
y
y
xx
xx
xx
2
1
1
221
111
,
1
1
1
yX
Model assumption: Output Y is linear in the inputs X=(X1, X2, X3,…, Xp)
Tp
jjj XXY
10
ˆPredict the output by:
Vector notation, 1 included in X
Where,
Also known as multiple-regression when p>1
![Page 3: Linear Methods for Regression](https://reader035.fdocuments.in/reader035/viewer/2022062304/5681449c550346895db1499b/html5/thumbnails/3.jpg)
3
Least Square Solution
N
i
p
jjiji
N
iii xyxfyRSS
1
2
10
1
2 )())(()(
)()()( XyXy TRSS
0)(2
XyXTRSS
yXXX TT 1)(ˆ
residual
Known as least square solution
),,,( 002010 pxxxx For a new input
yXXX TTTT xxxY 1000 )(ˆ)(ˆ The regression output is
Residual sum of squares:
In matrix-vector notation:
Vector differentiation:
Solution:
![Page 4: Linear Methods for Regression](https://reader035.fdocuments.in/reader035/viewer/2022062304/5681449c550346895db1499b/html5/thumbnails/4.jpg)
4
Bias-Variance Decomposition
)(XfY
yXXX TTTxxfxy 1000 )()(ˆ)(ˆ
TXXf )(
Estimator:
Unbiased estimator! Ex. Show the last step
Model: has zero expectationsame varianceuncorrelated
where
Bias:
0
])([
])([
)]()([
]ˆ)([
)](ˆ[)(
10
1000
100
100
00
εXXX
εXXX
εXXXX
XXX
TTT
TTTTT
TTTT
TTTT
xE
xxEx
xEx
xEx
xfExf
Variance:
)/(
])([
])([
])()([
]ˆ)([
)]()(ˆ[
)]](ˆ[)(ˆ[
2
210
20
100
20
10
20
10
200
200
Np
xE
xxxE
xxE
xxE
xfxfE
xfExfE
TTT
TTTTT
TTTT
TTTT
εXXX
εXXX
εXXXX
XXX
Decomposition of EPE:
]))(ˆ[)(ˆ[(]))(ˆ[)([(][
)]](ˆ[)](ˆ[)(ˆ)([
)](ˆ)([)](ˆ)([)(
200
200
2
20000
200
2000
xfExfExfExfEE
xfExfExfxfE
xfxfExyxyExEPE
Irreducible error= 2 Sq. bias=0 Variance= 2(p/N)
Linear
![Page 5: Linear Methods for Regression](https://reader035.fdocuments.in/reader035/viewer/2022062304/5681449c550346895db1499b/html5/thumbnails/5.jpg)
5
Gauss-Markov Theorem
)](ˆ[)( 00 xfExf
ycyXXX TTTTxxf 01
00 )()(ˆ
)]([)( 00 xgExf
Gauss-Markov Theorem: least square estimate has the minimum varianceamong all linear unbiased estimators
Interpretation:
The estimator found by least squares is linear in y
We have noticed that this estimator is unbiased, i.e.,
If we find any other unbiased estimator g(x0) of f(x0) that is linear in y too, i.e.,
,)( 0 ycTxg
then )].([)](ˆ[ 00 xgVarxfVar
and
Question: Is the LS the best estimator for the given linear additive model?
![Page 6: Linear Methods for Regression](https://reader035.fdocuments.in/reader035/viewer/2022062304/5681449c550346895db1499b/html5/thumbnails/6.jpg)
6
Subset Selection
• LS solution often has large variance (remember that variance is proportional to the number of inputs p, i.e., model complexity)
• If we decrease the number of input variables p, we can decrease the variance, however we then sacrifice the zero bias
• If this trade-off decreases test error, the solution can be accepted
• This reasoning leads to subset selection, i.e., select a subset from the p inputs for the regression computation
• Subset selection has another advantage– easy and focused interpretation of the input variables on the output
![Page 7: Linear Methods for Regression](https://reader035.fdocuments.in/reader035/viewer/2022062304/5681449c550346895db1499b/html5/thumbnails/7.jpg)
7
Subset Selection…
p
jjjXY
10
ˆ
Can we determine which j s are insignificant?
Yes, we can by statistical hypothesis testing!
However, we need a model assumption:
p
jjjXY
10
is zero mean Gaussian with standard deviation
![Page 8: Linear Methods for Regression](https://reader035.fdocuments.in/reader035/viewer/2022062304/5681449c550346895db1499b/html5/thumbnails/8.jpg)
8
Subset Selection: Statistical Significance Test
))(,(~ˆ 21 XXTN
j
jj
vz
ˆ
ˆ
The linear model with additive Gaussian noise has the following properties:
Ex. Show this.
So we can form a standardized coefficient or Z-score test for each coefficient:
N
iii yy
pN 1
2)ˆ(1
1̂ and vj is the jth diagonal element of (XTX)-1
Hypothesis testing principle says that a large value of Z-score should retainThe coefficient, a small value should discard the coefficient
How large/small – depends on the significance level
where
![Page 9: Linear Methods for Regression](https://reader035.fdocuments.in/reader035/viewer/2022062304/5681449c550346895db1499b/html5/thumbnails/9.jpg)
9
Case Study: Prostate Cancer
Output = log prostate-specific antigen
Input = ( log cancer volume, log prostate weight, age, log of benign prostatic hyperplacia, seminal vesicle invasion,log of capsular penetration, Gleason score, % of Gleason score4 or 5)
Goal: (1) predict the output given a novel input(2) Interpret the influence of the inputs on the output
![Page 10: Linear Methods for Regression](https://reader035.fdocuments.in/reader035/viewer/2022062304/5681449c550346895db1499b/html5/thumbnails/10.jpg)
10
Case Study…
Scatter plot
Hard to interpret which onesare most influencing
Also we want to find out howthe inputs jointly influence theoutput
![Page 11: Linear Methods for Regression](https://reader035.fdocuments.in/reader035/viewer/2022062304/5681449c550346895db1499b/html5/thumbnails/11.jpg)
11
Subset Selection on Prostate Cancer Data
Term Coefficient Std. Error Z-score
Intercept 2.48 0.09 27.66
Lcavol 0.68 0.13 5.37
Lweight 0.30 0.11 2.75
Age -0.14 0.10 -1.40
Lbph 0.21 0.10 2.06
Svi 0.31 0.12 2.47
Lcp -0.29 0.15 -1.87
Gleasson -0.02 0.15 -0.15
Pgg45 0.27 0.15 1.74
Scores with magnitude greater than 2 indicate significant variablesat 5% significance level
![Page 12: Linear Methods for Regression](https://reader035.fdocuments.in/reader035/viewer/2022062304/5681449c550346895db1499b/html5/thumbnails/12.jpg)
12
Coefficient Shrinkage: Ridge Regression Method
yXIXX TT
p
jj
N
i
p
jjiji xy
1
1
2
1
2
10
ridge
)(
})({minargˆ
One computational advantage is that the matrix is always invertible
If L2 norm is replaced by L1 norm, the corresponding regression is calledLASSO (see [HTF])
Non-negative penalty
![Page 13: Linear Methods for Regression](https://reader035.fdocuments.in/reader035/viewer/2022062304/5681449c550346895db1499b/html5/thumbnails/13.jpg)
13
Ridge Regression…
coefficient
Decreasing
One way to determine is cross validation – we’ll learn about it later