D ¯( N · 2014-12-17 · Review of Lecture 8 • Bias and riance va Exp ected value of E out...
Transcript of D ¯( N · 2014-12-17 · Review of Lecture 8 • Bias and riance va Exp ected value of E out...
![Page 1: D ¯( N · 2014-12-17 · Review of Lecture 8 • Bias and riance va Exp ected value of E out w.r.t. D = bias + r va f H bias r va f H g(D)(x) → g¯(x) → f(x) rning Lea curves](https://reader034.fdocuments.in/reader034/viewer/2022042312/5edaaebfea30a273770f47de/html5/thumbnails/1.jpg)
Review of Le ture 8• Bias and varian eExpe ted value of Eout w.r.t. D
= bias + varf
H bias
var f
H
g(D)(x) → g(x) → f(x)
• Learning urvesHow Ein and Eout vary with N
PSfrag repla ements
Number of Data Points, N
Expe tedError
biasvarian e Eout
Ein
204060800.160.170.180.190.20.210.22
PSfrag repla ements
Number of Data Points, N
Expe tedError
in-sample errorgeneralization errorEout
Ein
204060800.160.170.180.190.20.210.22• N ∝ �VC dimension�
B-V:
VC:
![Page 2: D ¯( N · 2014-12-17 · Review of Lecture 8 • Bias and riance va Exp ected value of E out w.r.t. D = bias + r va f H bias r va f H g(D)(x) → g¯(x) → f(x) rning Lea curves](https://reader034.fdocuments.in/reader034/viewer/2022042312/5edaaebfea30a273770f47de/html5/thumbnails/2.jpg)
Learning From DataYaser S. Abu-MostafaCalifornia Institute of Te hnologyLe ture 9: The Linear Model II
Sponsored by Calte h's Provost O� e, E&AS Division, and IST • Tuesday, May 1, 2012
![Page 3: D ¯( N · 2014-12-17 · Review of Lecture 8 • Bias and riance va Exp ected value of E out w.r.t. D = bias + r va f H bias r va f H g(D)(x) → g¯(x) → f(x) rning Lea curves](https://reader034.fdocuments.in/reader034/viewer/2022042312/5edaaebfea30a273770f47de/html5/thumbnails/3.jpg)
Where we are• Linear lassi� ation X
• Linear regression X
• Logisti regression ?• Nonlinear transforms ≈X
© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 2/24
![Page 4: D ¯( N · 2014-12-17 · Review of Lecture 8 • Bias and riance va Exp ected value of E out w.r.t. D = bias + r va f H bias r va f H g(D)(x) → g¯(x) → f(x) rning Lea curves](https://reader034.fdocuments.in/reader034/viewer/2022042312/5edaaebfea30a273770f47de/html5/thumbnails/4.jpg)
Nonlinear transformsx = (x0, x1, · · · , xd)
Φ−→ z = (z0, z1, · · · · · · · · · · · · , zd)
Ea h zi = φi(x) z = Φ(x)
Example: z = (1, x1, x2, x1x2, x21, x
22)
Final hypothesis g(x) in X spa e:sign (wTΦ(x)
) or wTΦ(x)
© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 3/24
![Page 5: D ¯( N · 2014-12-17 · Review of Lecture 8 • Bias and riance va Exp ected value of E out w.r.t. D = bias + r va f H bias r va f H g(D)(x) → g¯(x) → f(x) rning Lea curves](https://reader034.fdocuments.in/reader034/viewer/2022042312/5edaaebfea30a273770f47de/html5/thumbnails/5.jpg)
The pri e we payx = (x0, x1, · · · , xd)
Φ−→ z = (z0, z1, · · · · · · · · · · · · , zd)
↓ ↓
w w
dv = d + 1 dv ≤ d + 1
© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 4/24
![Page 6: D ¯( N · 2014-12-17 · Review of Lecture 8 • Bias and riance va Exp ected value of E out w.r.t. D = bias + r va f H bias r va f H g(D)(x) → g¯(x) → f(x) rning Lea curves](https://reader034.fdocuments.in/reader034/viewer/2022042312/5edaaebfea30a273770f47de/html5/thumbnails/6.jpg)
Two non-separable ases
PSfrag repla ements-1-0.500.51-1.5-1-0.500.511.5 © AM
L Creator: Yaser Abu-Mostafa - LFD Le ture 9 5/24
![Page 7: D ¯( N · 2014-12-17 · Review of Lecture 8 • Bias and riance va Exp ected value of E out w.r.t. D = bias + r va f H bias r va f H g(D)(x) → g¯(x) → f(x) rning Lea curves](https://reader034.fdocuments.in/reader034/viewer/2022042312/5edaaebfea30a273770f47de/html5/thumbnails/7.jpg)
First ase
PSfrag repla ements-1-0.500.51-1.5-1-0.500.511.5
Use a linear model in X ; a ept Ein > 0
orInsist on Ein = 0; go to high-dimensional Z
© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 6/24
![Page 8: D ¯( N · 2014-12-17 · Review of Lecture 8 • Bias and riance va Exp ected value of E out w.r.t. D = bias + r va f H bias r va f H g(D)(x) → g¯(x) → f(x) rning Lea curves](https://reader034.fdocuments.in/reader034/viewer/2022042312/5edaaebfea30a273770f47de/html5/thumbnails/8.jpg)
Se ond ase
PSfrag repla ements-1-0.500.51-1.5-1-0.500.511.5
z = (1, x1, x2, x1x2, x21, x
22)
Why not: z = (1, x21, x
22)
or better yet: z = (1, x21 + x2
2)
or even: z = (x21 + x2
2 − 0.6)
© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 7/24
![Page 9: D ¯( N · 2014-12-17 · Review of Lecture 8 • Bias and riance va Exp ected value of E out w.r.t. D = bias + r va f H bias r va f H g(D)(x) → g¯(x) → f(x) rning Lea curves](https://reader034.fdocuments.in/reader034/viewer/2022042312/5edaaebfea30a273770f47de/html5/thumbnails/9.jpg)
Lesson learnedLooking at the data before hoosing the model an be hazardous to your Eout
Data snooping
Learning From Data - Le ture 9 8/24
![Page 10: D ¯( N · 2014-12-17 · Review of Lecture 8 • Bias and riance va Exp ected value of E out w.r.t. D = bias + r va f H bias r va f H g(D)(x) → g¯(x) → f(x) rning Lea curves](https://reader034.fdocuments.in/reader034/viewer/2022042312/5edaaebfea30a273770f47de/html5/thumbnails/10.jpg)
Logisti regression - Outline• The model• Error measure• Learning algorithm
© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 9/24
![Page 11: D ¯( N · 2014-12-17 · Review of Lecture 8 • Bias and riance va Exp ected value of E out w.r.t. D = bias + r va f H bias r va f H g(D)(x) → g¯(x) → f(x) rning Lea curves](https://reader034.fdocuments.in/reader034/viewer/2022042312/5edaaebfea30a273770f47de/html5/thumbnails/11.jpg)
A third linear models =
d∑
i=0
wixi
linear lassi� ation linear regression logisti regressionh(x) = sign(s) h(x) = s h(x) = θ(s)
sx
x
x
x0
1
2
d
h x( )
sx
x
x
x0
1
2
d
h x( )
sx
x
x
x0
1
2
d
h x( )
© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 10/24
![Page 12: D ¯( N · 2014-12-17 · Review of Lecture 8 • Bias and riance va Exp ected value of E out w.r.t. D = bias + r va f H bias r va f H g(D)(x) → g¯(x) → f(x) rning Lea curves](https://reader034.fdocuments.in/reader034/viewer/2022042312/5edaaebfea30a273770f47de/html5/thumbnails/12.jpg)
The logisti fun tion θ
The formula:θ(s) =
es
1 + es
PSfrag repla ements
θ(s)1
0 s
-4-202400.51soft threshold: un ertaintysigmoid: �attened out `s'
© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 11/24
![Page 13: D ¯( N · 2014-12-17 · Review of Lecture 8 • Bias and riance va Exp ected value of E out w.r.t. D = bias + r va f H bias r va f H g(D)(x) → g¯(x) → f(x) rning Lea curves](https://reader034.fdocuments.in/reader034/viewer/2022042312/5edaaebfea30a273770f47de/html5/thumbnails/13.jpg)
Probability interpretationh(x) = θ(s) is interpreted as a probability
Example. Predi tion of heart atta ksInput x: holesterol level, age, weight, et .θ(s): probability of a heart atta kThe signal s = w
Tx �risk s ore�
h(x) = θ(s) © AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 12/24
![Page 14: D ¯( N · 2014-12-17 · Review of Lecture 8 • Bias and riance va Exp ected value of E out w.r.t. D = bias + r va f H bias r va f H g(D)(x) → g¯(x) → f(x) rning Lea curves](https://reader034.fdocuments.in/reader034/viewer/2022042312/5edaaebfea30a273770f47de/html5/thumbnails/14.jpg)
Genuine probabilityData (x, y) with binary y, generated by a noisy target:
P (y | x) =
{
f(x) for y = +1;
1−f(x) for y = −1.
The target f : Rd→ [0, 1] is the probability
Learn g(x) = θ(wTx) ≈ f(x)
© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 13/24
![Page 15: D ¯( N · 2014-12-17 · Review of Lecture 8 • Bias and riance va Exp ected value of E out w.r.t. D = bias + r va f H bias r va f H g(D)(x) → g¯(x) → f(x) rning Lea curves](https://reader034.fdocuments.in/reader034/viewer/2022042312/5edaaebfea30a273770f47de/html5/thumbnails/15.jpg)
Error measureFor ea h (x, y), y is generated by probability f(x)
Plausible error measure based on likelihood:If h = f , how likely to get y from x?P (y | x) =
{
h(x) for y = +1;
1− h(x) for y = −1.
© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 14/24
![Page 16: D ¯( N · 2014-12-17 · Review of Lecture 8 • Bias and riance va Exp ected value of E out w.r.t. D = bias + r va f H bias r va f H g(D)(x) → g¯(x) → f(x) rning Lea curves](https://reader034.fdocuments.in/reader034/viewer/2022042312/5edaaebfea30a273770f47de/html5/thumbnails/16.jpg)
Formula for likelihoodP (y | x) =
{
h(x) for y = +1;
1− h(x) for y = −1.
Substitute h(x) = θ(wTx), noting θ(−s) = 1− θ(s)
PSfrag repla ements
θ(s)
1
0 s
-4-202400.51P (y | x) = θ(y w
Tx)
Likelihood of D = (x1, y1), . . . , (xN , yN) isN∏
n=1
P (yn | xn) =
N∏
n=1
θ(ynwTxn)
© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 15/24
![Page 17: D ¯( N · 2014-12-17 · Review of Lecture 8 • Bias and riance va Exp ected value of E out w.r.t. D = bias + r va f H bias r va f H g(D)(x) → g¯(x) → f(x) rning Lea curves](https://reader034.fdocuments.in/reader034/viewer/2022042312/5edaaebfea30a273770f47de/html5/thumbnails/17.jpg)
Maximizing the likelihoodMinimize −
1
Nln
(N∏
n=1
θ(yn wTxn)
)
=1
N
N∑
n=1
ln
(1
θ(yn wTxn)
) [
θ(s) =1
1 + e−s
]
Ein(w) =1
N
N∑
n=1
ln(
1 + e−ynwTxn
)
︸ ︷︷ ︸e(h(xn),yn)� ross-entropy� error © AM
L Creator: Yaser Abu-Mostafa - LFD Le ture 9 16/24
![Page 18: D ¯( N · 2014-12-17 · Review of Lecture 8 • Bias and riance va Exp ected value of E out w.r.t. D = bias + r va f H bias r va f H g(D)(x) → g¯(x) → f(x) rning Lea curves](https://reader034.fdocuments.in/reader034/viewer/2022042312/5edaaebfea30a273770f47de/html5/thumbnails/18.jpg)
Logisti regression - Outline• The model• Error measure• Learning algorithm
© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 17/24
![Page 19: D ¯( N · 2014-12-17 · Review of Lecture 8 • Bias and riance va Exp ected value of E out w.r.t. D = bias + r va f H bias r va f H g(D)(x) → g¯(x) → f(x) rning Lea curves](https://reader034.fdocuments.in/reader034/viewer/2022042312/5edaaebfea30a273770f47de/html5/thumbnails/19.jpg)
How to minimize EinFor logisti regression,
Ein(w) =1
N
N∑
n=1
ln(
1 + e−ynwTxn
)
Compare to linear regression:Ein(w) =
1
N
N∑
n=1
(wTxn − yn)
2←− losed-form solution
←− iterative solution
© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 18/24
![Page 20: D ¯( N · 2014-12-17 · Review of Lecture 8 • Bias and riance va Exp ected value of E out w.r.t. D = bias + r va f H bias r va f H g(D)(x) → g¯(x) → f(x) rning Lea curves](https://reader034.fdocuments.in/reader034/viewer/2022042312/5edaaebfea30a273770f47de/html5/thumbnails/20.jpg)
Iterative method: gradient des ent
PSfrag repla ements
Weights, wIn-sampleError,E
in-10-8-6-4-20210152025
General method for nonlinear optimizationStart at w(0); take a step along steepest slopeFixed step size: w(1) = w(0) + η v
What is the dire tion v?
Ein(w)
© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 19/24
![Page 21: D ¯( N · 2014-12-17 · Review of Lecture 8 • Bias and riance va Exp ected value of E out w.r.t. D = bias + r va f H bias r va f H g(D)(x) → g¯(x) → f(x) rning Lea curves](https://reader034.fdocuments.in/reader034/viewer/2022042312/5edaaebfea30a273770f47de/html5/thumbnails/21.jpg)
Formula for the dire tion v
∆Ein = Ein(w(0) + ηv )− Ein(w(0))
= η∇Ein(w(0))tv + O(η2)
≥ −η‖∇Ein(w(0))‖
Sin e v is a unit ve tor,v = −
∇Ein(w(0))
‖∇Ein(w(0))‖ © AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 20/24
![Page 22: D ¯( N · 2014-12-17 · Review of Lecture 8 • Bias and riance va Exp ected value of E out w.r.t. D = bias + r va f H bias r va f H g(D)(x) → g¯(x) → f(x) rning Lea curves](https://reader034.fdocuments.in/reader034/viewer/2022042312/5edaaebfea30a273770f47de/html5/thumbnails/22.jpg)
Fixed-size step?How η a�e ts the algorithm:
PSfrag repla ements
Weights, wIn-sampleError,Ein
-1-0.8-0.6-0.4-0.200.20.40.60.8100.20.40.60.81
PSfrag repla ements
Weights, wIn-sampleError,E
in-1-0.8-0.6-0.4-0.200.20.40.60.8100.050.10.150.20.25
PSfrag repla ements
Weights, wIn-sampleError,E
in large η
small η
-1-0.8-0.6-0.4-0.200.20.40.60.8100.20.40.60.81η too small η too large variable η � just right
η should in rease with the slope © AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 21/24
![Page 23: D ¯( N · 2014-12-17 · Review of Lecture 8 • Bias and riance va Exp ected value of E out w.r.t. D = bias + r va f H bias r va f H g(D)(x) → g¯(x) → f(x) rning Lea curves](https://reader034.fdocuments.in/reader034/viewer/2022042312/5edaaebfea30a273770f47de/html5/thumbnails/23.jpg)
Easy implementationInstead of
∆w = η v
= − η∇Ein(w(0))
‖∇Ein(w(0))‖
Have∆w = − η ∇Ein(w(0))
Fixed learning rate η © AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 22/24
![Page 24: D ¯( N · 2014-12-17 · Review of Lecture 8 • Bias and riance va Exp ected value of E out w.r.t. D = bias + r va f H bias r va f H g(D)(x) → g¯(x) → f(x) rning Lea curves](https://reader034.fdocuments.in/reader034/viewer/2022042312/5edaaebfea30a273770f47de/html5/thumbnails/24.jpg)
Logisti regression algorithm1: Initialize the weights at t = 0 to w(0)2: for t = 0, 1, 2, . . . do3: Compute the gradient
∇Ein = −1
N
N∑
n=1
ynxn
1 + e ynwT(t)xn
4: Update the weights: w(t + 1) = w(t)− η∇Ein5: Iterate to the next step until it is time to stop6: Return the �nal weights w
© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 23/24
![Page 25: D ¯( N · 2014-12-17 · Review of Lecture 8 • Bias and riance va Exp ected value of E out w.r.t. D = bias + r va f H bias r va f H g(D)(x) → g¯(x) → f(x) rning Lea curves](https://reader034.fdocuments.in/reader034/viewer/2022042312/5edaaebfea30a273770f47de/html5/thumbnails/25.jpg)
Summary of Linear Models
CreditAnalysis Amountof CreditApproveor DenyProbabilityof Default
Per eptronLogisti RegressionLinear Regression Classi� ation ErrorPLA, Po ket,. . .
Cross-entropy ErrorGradient des entSquared ErrorPseudo-inverse
© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 24/24