Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum...
Transcript of Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum...
![Page 1: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/1.jpg)
Administrivia• Piazza!!!• Panopto is down; I’ll start manually uploading
lecture recordings; see piazza for now• HW0, HW1• Office hours are happening
• wiki: “people/office hours”• Social hours on gather.town
• after class, and Asian hours.
![Page 2: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/2.jpg)
Survey results• Do the quizzes and surveys!• TAs: Mute people who aren’t speaking• TAs: preface you chats with “TA”• The course moves fast• Don’t like students interrupting with questions• What do we need to know in the readings?• Worked examples?• Why is this math useful?
![Page 3: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/3.jpg)
Slide 3
Linear RegressionLyle Ungar
Learning objectivesBe able to derive MLE & MAP regression and the associated loss functionsRecognize scale invariance
![Page 4: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/4.jpg)
• MLE estimatesA) argmaxθ p(θ|D) B) argmaxθ p(D|θ) C) argmaxθ p(D|θ)p(θ)D) None of the above
![Page 5: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/5.jpg)
• MAP estimatesA) argmaxθ p(θ|D) B) argmaxθ p(D|θ) C) argmaxθ p(D|θ)p(θ)D) None of the above
![Page 6: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/6.jpg)
MLE vs MAP• What will we use:
A) MLEB) MAP
• Why?
![Page 7: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/7.jpg)
Consistent estimator• A consistent estimator (or asymptotically consistent
estimator) is an estimator — a rule for computing estimates of a parameter θ — having the property that as the number of data points used increases indefinitely, the resulting sequence of estimates converges in probability to the true parameter θ.
https://en.wikipedia.org/wiki/Consistent_estimator
![Page 8: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/8.jpg)
Slide 8
Which is consistent for our coin-flipping example?A) MLEB) MAPC) BothD) Neither
P(D|θ) P(θ|D) ~ P(D|θ)P(θ)
![Page 9: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/9.jpg)
Slide 9Copyright © 2001, 2003, Andrew W. Moore
An introduction to regressionMostly by Andrew W. Moore
But with many modifications by Lyle Ungar
Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in
giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion
of these slides in your own lecture, please include this message, or the following link to the source repository of
Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.
![Page 10: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/10.jpg)
Two interpretations of regression• Linear regression
• ŷ = w.x
• Probabilistic/Bayesian (MLE and MAP)• y(x) ~ N(w.x, s2)• MLE: argmaxw p(D|w) here: argmaxw p(y|w,X)
• MAP: argmaxw p(D|w)p(w)
• Error minimization• ||y - Xw||pp + l ||w||qq
![Page 11: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/11.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Single-Parameter Linear Regression
![Page 12: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/12.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Linear Regression
Linear regression assumes that the expected value of the output given an input, E[y|x], is linear in x.Simplest case: ŷ(x) = wx for some unknown w. (x,w scalars)Given the data, we can estimate w.
inputs outputs
x1 = 1 y1 = 1
x2 = 3 y2 = 2.2
x3 = 2 y3 = 2
x4 = 1.5 y4 = 1.9
x5 = 4 y5 = 3.1¬ 1 ®
w¯
![Page 13: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/13.jpg)
Copyright © 2001, 2003, Andrew W. Moore
One parameter linear regressionAssume that the data is formed by
yi = wxi + noisei
where…• noisei is independent N(0, σ2)• variance σ2 is unknown y(x) then has a normal distribution with• mean wx• variance σ2
![Page 14: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/14.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Bayesian Linear Regressionp(y |w,x) is Normal(mean: wx, variance: σ2)
y ~ N(wx, σ2)We have a data (x1,y1) (x2,y2) … (xn,yn)We want to estimate w from the data.
p(w |x1, x2, x3,…xn, y1, y2…yn) = P(w|D)•You can use BAYES rule to find a posterior distribution for w given the data.•Or you could do Maximum Likelihood Estimation
![Page 15: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/15.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Maximum likelihood estimation of w
MLE asks :“For which value of w is this data most likely to have happened?”
<=>For what w is
p(y1, y2…yn |w, x1, x2, x3,…xn) maximized?<=>
For what w is maximized? ),(1
i
n
ii xwypÕ
=
![Page 16: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/16.jpg)
Copyright © 2001, 2003, Andrew W. Moore
For what w is
For what w is
For what w is
For what w is
maximized? ),(1
i
n
ii xwypÕ
=
maximized? ))(21exp( 2
1 sii wxy
n
i
-
=Õ -
maximized? 2
1 21
÷øö
çèæ -
-å= s
iin
i
wxy
( ) minimized? 2
1å=
-n
iii wxy
![Page 17: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/17.jpg)
Result: MLE = L2 error• MLE with Gaussian noise is the same as
minimizing the L2 error
Copyright © 2001, 2003, Andrew W. Moore
argmin yi −wxi( )i=1
n
∑2
![Page 18: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/18.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Linear Regression
The maximum likelihood w is the one that minimizes sum-of-squares of residualsri = yi – w xi
We want to minimize a quadratic function of w.
( )
( ) ( ) 222
2
2 wxwyxy
wxy
ii
iii
iii
åå å
å+-=
-=E
E(w) w
![Page 19: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/19.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Linear RegressionThe sum of squares is
minimized when
2åå=
i
ii
x
yxw
The maximum likelihood model is
We can use it for prediction
Note: In Bayesian stats you’d have ended up with a prob. distribution of w
And predictions would have given a prob. distribution of expected output
Often useful to know your confidence. Max likelihood can give some kinds of confidence, too.
p(w)
w
ŷ(x) = wx
![Page 20: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/20.jpg)
But what about MAP?• MLE
• MAP
Copyright © 2001, 2003, Andrew W. Moore
argmax p(yii=1
n
∏ w, xi )
argmax p(yii=1
n
∏ w, xi )p(w)
![Page 21: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/21.jpg)
But what about MAP?• MAP
• We assumed • yi ~ N(w xi, s2)
• Now add a prior assumption that• w ~ N(0, g2)
argmax p(yii=1
n
∏ w, xi )p(w)
![Page 22: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/22.jpg)
Copyright © 2001, 2003, Andrew W. Moore
For what w is
For what w is
For what w is
For what w is
p(yii=1
n
∏ w, xi ) p(w) maximized?
exp(− 12i=1
n
∏ ( yi−wxi
σ)2 ) exp(− 1
2(wγ
)2 )maximized?
−12i=1
n
∑ yi −wxiσ
#
$%
&
'(
2
− 12
(wγ
)2 maximized?
yi −wxi( )i=1
n
∑2
+ (σwγ
)2 minimized?
![Page 23: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/23.jpg)
Ridge Regression is MAP• MAP with a Gaussian prior on w is the same as
minimizing the L2 error plus an L2 penalty on w
• This is called• Ridge regression• Shrinkage• Tikhonov Regularization
Copyright © 2001, 2003, Andrew W. Moore
argmin yi −wxi( )i=1
n
∑222
+λw2 l = s2/g2
![Page 24: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/24.jpg)
Ridge Regression (MAP)• w = xTy/(xTx + l)
=(xTx+ l)-1 xTy
![Page 25: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/25.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Multivariate Linear Regression
![Page 26: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/26.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Multivariate RegressionWhat if the inputs are vectors?
Dataset has form:
x1 y1x2 y2x3 y3.: :
.xn yn
2-d input example
x1
x2
![Page 27: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/27.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Multivariate RegressionWrite matrix X and Y thus:
x =
.....x1.....
.....x2.....!
.....xn.....
⎡
⎣
⎢⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥⎥
=
x11 x12 ... x1p
x21 x22 ... x2 p
!xn1 xn2 ... xnp
⎡
⎣
⎢⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥⎥
y =
y1
y2
!yn
⎡
⎣
⎢⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥⎥
(n data points; Each input has p features)
The linear regression model assumes ŷ(x) = x .w = w1x1 + w2x2+ ….wpxp
![Page 28: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/28.jpg)
Multivariate Regression
The maximum likelihood estimate (MLE) is w = (XTX)-1(XTy)
XTX is p x pXTy is p x 1
![Page 29: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/29.jpg)
Multivariate Regression
The MAP estimate is w = (XTX + l I)-1(XTy)
XTX is p x p I is the p x p identity matrixXTy is p x 1
![Page 30: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/30.jpg)
Copyright © 2001, 2003, Andrew W. Moore
What about a constant term?Linear data usually does not go through the origin.
Statisticians and Neural Net folks all agree on a simple obvious hack.Can you guess??
![Page 31: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/31.jpg)
Copyright © 2001, 2003, Andrew W. Moore
The constant term• The trick: create a fake input “x0” that is always 1
X1 X2 Y2 4 163 4 175 5 20
X0 X1 X2 Y1 2 4 161 3 4 171 5 5 20
Before:Y=w1X1+ w2X2
…has to be a poor model
After:Y= w0X0+w1X1+ w2X2
= w0+w1X1+ w2X2 …has a fine constant term
![Page 32: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/32.jpg)
L1 regressionOLS = L2 regression minimizes
p(y|w,x) ~ exp (-||y-x.w||22/2s2) ⇢ argminw ||y-Xw||22
L1 regression: p(y|w,x) ~ exp (-||y-x.w||1/2s2) ⇢ argminw ||y-Xw||1
![Page 33: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/33.jpg)
Scale Invariance• Rescaling does not affect decision trees or OLS
• They are scale invariant• Rescaling does affect Ridge regression
• Because it preferentially shrinks ‘large’ coefficients
![Page 34: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/34.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Linear Regression with varying noise
Heterosceda
sticity...
![Page 35: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/35.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Regression with varying noise• Suppose you know the variance of the noise that was
added to each datapoint.
x=0 x=3x=2x=1y=0
y=3
y=2
y=1
s=1/2
s=2
s=1s=1/2
s=2xi yi si2
½ ½ 41 1 12 1 1/42 3 43 2 1/4
),(~ 2iii wxNy sAssume What’s th
e MLE
estimate of w?
![Page 36: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/36.jpg)
Copyright © 2001, 2003, Andrew W. Moore
MLE estimation with varying noise=),,...,,,,...,,|,...,,(log 22
2212121argmax wxxxyyyp
wRRR sss
=-å
=
R
i i
ii wxy
w 12
2)(argmin s
=÷÷ø
öççè
æ=
-å=
0)(such that 1
2
R
i i
iii wxyxws
÷÷ø
öççè
æ
÷÷ø
öççè
æ
å
å
=
=
R
i i
i
R
i i
ii
x
yx
12
21
2
s
s
Assuming independence among noise and then
plugging in equation for Gaussian and simplifying.
Setting dLL/dw equal to zero
Trivial algebra
Note that “R” here iswhat we call “n”
![Page 37: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/37.jpg)
Copyright © 2001, 2003, Andrew W. Moore
This is Weighted Regression• We are minimizing the weighted sum of squares
x=0 x=3x=2x=1y=0
y=3
y=2
y=1
s=1/2
s=2
s=1s=1/2
s=2
å=
-R
i i
ii wxy
w 12
2)(argmin s
21
iswhere the weight for i’th datapoint is
![Page 38: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/38.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Nonlinear Regression
![Page 39: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/39.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Nonlinear Regression• Suppose you know that y is related to a function of x in such a way that
the predicted values have a non-linear dependence on w, e.g.:
x=0 x=3x=2x=1y=0
y=3
y=2
y=1
xi yi½ ½1 2.52 33 23 3
),(~ 2sii xwNy +Assume What’s the MLE
estimate of w?
![Page 40: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/40.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Nonlinear MLE estimation=),,,...,,|,...,,(log 2121argmax wxxxyyyp
wRR s
( ) =+-å=
R
iii xwy
w 1
2argmin
=÷÷ø
öççè
æ=
+
+-å=
0such that 1
R
i i
ii
xwxwy
w
Assuming i.i.d. and then plugging in
equation for Gaussian and simplifying.
Setting dLL/dwequal to zero
![Page 41: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/41.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Nonlinear MLE estimation=),,,...,,|,...,,(log 2121argmax wxxxyyyp
wRR s
( ) =+-å=
R
iii xwy
w 1
2argmin
=÷÷ø
öççè
æ=
+
+-å=
0such that 1
R
i i
ii
xwxwy
w
Assuming i.i.d. and then plugging in
equation for Gaussian and simplifying.
Setting dLL/dw equal to zero
We’re down the algebraic toilet
So guess what
we do?
![Page 42: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/42.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Nonlinear MLE estimation=),,,...,,|,...,,(log 2121argmax wxxxyyyp
wRR s
( ) =+-å=
R
iii xwy
w 1
2argmin
=÷÷ø
öççè
æ=
+
+-å=
0such that 1
R
i i
ii
xwxwy
w
Assuming i.i.d. and then plugging in equation for Gaussian and simplifying.
Setting dLL/dw equal to zero
We’re down the algebraic toilet
So guess what
we do?
Common (but not only) approach:Numerical Solutions:• Line Search• Simulated Annealing• Gradient Descent• Conjugate Gradient• Levenberg Marquart• Newton’s Method
Also, special purpose statistical optimization-specific tricks such as EM
![Page 43: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/43.jpg)
Scale invariance• A method is scale invariant if multiplying any feature
(any column of X) by a nonzero constant does notchange the predictions made (after retraining)
y X1 0.2 3 100 0.3 4 100 0.1 4 81 0.2 2 4
y X1 0.2 3 50 0.3 4 50 0.1 4 41 0.2 2 2
Linear regression is scale invariant?Ridge regression is scale invariant?
K-NN is scale invariant?
![Page 44: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/44.jpg)
What we have seen• MLE with Gaussian noise minimizes the L2 error
• Other noise models will give other loss functions• MAP with a Gaussian prior gives Ridge regression
• Other priors will give different penalties• Both are consistent• Linear Regression is scale invariant; Ridge is not• One can
• do nonlinear regression• make nonlinear relations linear by transforming the features
||y - Xw||pp + l ||w||qq
![Page 45: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/45.jpg)
Slide 45
![Page 46: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/46.jpg)
Slide 46
![Page 47: Linear Regression - Penn Engineeringcis520/lectures/3_regression.pdf · Linear Regression The sum of squares is minimized when å 2 å = i i i x x y w The maximum likelihood model](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc4af916159526a8b20fcfd/html5/thumbnails/47.jpg)
Gather.town• https://gather.town/aQMGI0l1R8DP0Ovv/penn-cis