Assessing Performance - University of Washington · 2017-01-30 · Assessing Performance CSE 446:...
Transcript of Assessing Performance - University of Washington · 2017-01-30 · Assessing Performance CSE 446:...
1/11/2017
1
CSE 446: Machine Learning
Assessing PerformanceCSE 446: Machine LearningEmily FoxUniversity of WashingtonJanuary 11, 2017
©2017 Emily Fox
CSE 446: Machine Learning2
Make predictions, get $, right??
©2017 Emily Fox
Model
Algorithm
Fit f
Model + algorithm fitted function
Predictions decisions outcome
1/11/2017
2
CSE 446: Machine Learning3
Or, how much am I losing?
Example: Lost $ due to inaccurate listing price- Too low low offers
- Too high few lookers + no/low offers
How much am I losing compared to perfection?
Perfect predictions: Loss = 0
My predictions: Loss = ???
©2017 Emily Fox
CSE 446: Machine Learning4
Measuring loss
Loss function:
L(y,fŵ(x))
Examples: (assuming loss for underpredicting = overpredicting)
Absolute error: L(y,fŵ(x)) = |y-fŵ(x)|
Squared error: L(y,fŵ(x)) = (y-fŵ(x))2
©2017 Emily Fox
actual value
Cost of using ŵat x when y is true
= predicted value ŷf(x)
1/11/2017
3
CSE 446: Machine Learning©2017 Emily Fox
“Remember that all models are wrong; the practical question is how wrong do they have
to be to not be useful.” George Box, 1987.
CSE 446: Machine Learning
Assessing the loss
©2017 Emily Fox
1/11/2017
4
CSE 446: Machine Learning
Assessing the lossPart 1: Training error
©2017 Emily Fox
CSE 446: Machine Learning8
Define training data
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($
)
x
y
1/11/2017
5
CSE 446: Machine Learning9
Define training data
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($
)
x
y
CSE 446: Machine Learning10
Example:Fit quadratic to minimize RSS
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($
)
x
y
ŵminimizesRSS of
training data
1/11/2017
6
CSE 446: Machine Learning11
Compute training error
1. Define a loss function L(y,fŵ(x))- E.g., squared error, absolute error,…
2. Training error
= avg. loss on houses in training set
= L(yi,fŵ(xi))
©2017 Emily Fox
fit using training data
CSE 446: Machine Learning12
Example: Use squared error loss (y-fŵ(x))2
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($
)
x
y
Training error (ŵ) = 1/N *
[($train 1-fŵ(sq.ft.train 1))2
+ ($train 2-fŵ(sq.ft.train 2))2
+ ($train 3-fŵ(sq.ft.train 3))2
+ … include all training houses]
1/11/2017
7
CSE 446: Machine Learning13
Training error vs. model complexity
©2017 Emily Fox
Model complexity
Err
or
square feet (sq.ft.)
pri
ce
($)
x
y
CSE 446: Machine Learning14
Training error vs. model complexity
©2017 Emily Fox
Model complexity
Err
or
square feet (sq.ft.)
pri
ce
($)
x
y
1/11/2017
8
CSE 446: Machine Learning15
Training error vs. model complexity
©2017 Emily Fox
Model complexity
Err
or
square feet (sq.ft.)
pri
ce
($)
x
y
CSE 446: Machine Learning16
Training error vs. model complexity
©2017 Emily Fox
Model complexity
Err
or
square feet (sq.ft.)
pri
ce
($)
x
y
1/11/2017
9
CSE 446: Machine Learning17
Training error vs. model complexity
©2017 Emily Fox
Model complexity
Err
or
x
y
x
y
CSE 446: Machine Learning18
Is training error a good measure of predictive performance?
How do we expect to perform on a new house?
©2017 Emily Fox
square feet (sq.ft.)
pri
ce (
$)
x
y
1/11/2017
10
CSE 446: Machine Learning19
Is training error a good measure of predictive performance?
Is there something particularly bad about having xt sq.ft.??
©2017 Emily Fox
square feet (sq.ft.)
pri
ce (
$)
x
y
xt
CSE 446: Machine Learning20
Is training error a good measure of predictive performance?Issue: Training error is overly optimistic…ŵwas fit to training data
©2017 Emily Fox
square feet (sq.ft.)
pri
ce (
$)
x
y
xt
Small training error ≠> good predictions unless training data includes everything you
might ever see
1/11/2017
11
CSE 446: Machine Learning
Assessing the lossPart 2: Generalization (true) error
©2017 Emily Fox
CSE 446: Machine Learning22
Generalization error
Really want estimate of loss over all possible ( ,$) pairs
©2017 Emily Fox
Lots of housesin neighborhood,but not in dataset
1/11/2017
12
CSE 446: Machine Learning23
Distribution over houses
In our neighborhood, houses of what # sq.ft. ( ) are we likely to see?
©2017 Emily Fox
square feet (sq.ft.)
CSE 446: Machine Learning24
Distribution over sales prices
For houses with a given # sq.ft. ( ), what house prices $are we likely to see?
©2017 Emily Fox
price ($)
For fixed # sq.ft.
1/11/2017
13
CSE 446: Machine Learning25
Generalization error definition
Really want estimate of loss over all possible ( ,$) pairs
Formally:
generalization error = Ex,y[L(y,fŵ(x))]
©2017 Emily Fox
fit using training data
average over all possible(x,y) pairs weighted by
how likely each is
CSE 446: Machine Learning26
Generalization error vs. model complexity
©2017 Emily Fox
Model complexity
Err
or
square feet (sq.ft.)
pri
ce
($)
x
y
fŵ
1/11/2017
14
CSE 446: Machine Learning27
Generalization error vs. model complexity
©2017 Emily Fox
Model complexity
Err
or
square feet (sq.ft.)
pri
ce
($)
x
y
fŵ
CSE 446: Machine Learning28
Generalization error vs. model complexity
©2017 Emily Fox
Model complexity
Err
or
square feet (sq.ft.)
pri
ce
($)
x
y fŵ
1/11/2017
15
CSE 446: Machine Learning29
Generalization error vs. model complexity
©2017 Emily Fox
Model complexity
Err
or
square feet (sq.ft.)
pri
ce
($)
x
y fŵ
CSE 446: Machine Learning30
Generalization error vs. model complexity
©2017 Emily Fox
Model complexity
Err
or
square feet (sq.ft.)
pri
ce
($)
x
y
fŵ
1/11/2017
16
CSE 446: Machine Learning31
Generalization error vs. model complexity
©2017 Emily Fox
Model complexity
Err
or
x
y
x
y
Can’t compute!
CSE 446: Machine Learning
Assessing the lossPart 3: Test error
©2017 Emily Fox
1/11/2017
17
CSE 446: Machine Learning33
Approximating generalization error
Wanted estimate of loss over all possible ( ,$) pairs
©2017 Emily Fox
Approximate by looking athouses not in training set
CSE 446: Machine Learning34
Forming a test set
Hold out some ( ,$) that are not used for fitting the model
©2017 Emily Fox
Training set
Test set
1/11/2017
18
CSE 446: Machine Learning35 ©2017 Emily Fox
Training set
Test set
Proxy for “everything you might see”
Hold out some ( ,$) that are not used for fitting the model
Forming a test set
CSE 446: Machine Learning36
Compute test error
Test error
= avg. loss on houses in test set
= L(yi,fŵ(xi))
©2017 Emily Fox
fit using training data# test points
has never seen test data!
1/11/2017
19
CSE 446: Machine Learning37
Example: As before, fit quadratic to training data
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($
)
x
y
ŵminimizesRSS of
training data
CSE 446: Machine Learning38
Example: As before, use squared error loss (y-fŵ(x))2
©2017 Emily Fox
Test error (ŵ) = 1/Ntest *
[($test 1-fŵ(sq.ft.test 1))2
+ ($test 2-fŵ(sq.ft.test 2))2
+ ($test 3-fŵ(sq.ft.test 3))2
+ … include all test houses]square feet (sq.ft.)
pri
ce
($
)
x
y
1/11/2017
20
CSE 446: Machine Learning39
Training, true, & test error vs. model complexity
©2017 Emily Fox
Model complexity
Err
or
Overfitting if:
x
y
x
y
CSE 446: Machine Learning
Training/test split
©2017 Emily Fox
1/11/2017
21
CSE 446: Machine Learning41
Training/test splits
©2017 Emily Fox
Training set Test set
how many? how many?vs.
CSE 446: Machine Learning42
Training/test splits
©2017 Emily Fox
Too fewŵpoorly estimated
Training set
Test set
1/11/2017
22
CSE 446: Machine Learning43
Too few test error bad approximation of generalization error
Training/test splits
©2017 Emily Fox
Training setTest set
CSE 446: Machine Learning44
Training/test splits
Typically, just enough test points to form a reasonable estimate of generalization error
If this leaves too few for training, other methods like cross validation (will see later…)
©2017 Emily Fox
Training set Test set
1/11/2017
23
CSE 446: Machine Learning
3 sources of error + the bias-variance tradeoff
©2017 Emily Fox
CSE 446: Machine Learning46
3 sources of error
In forming predictions, there are 3 sources of error:
1. Noise
2. Bias
3. Variance
©2017 Emily Fox
1/11/2017
24
CSE 446: Machine Learning47
Data inherently noisy
©2017 Emily Fox
square feet (sq.ft.)
pri
ce (
$)
x
y yi = fw(true)(xi)+εi
Irreducible error
varianceof noise
CSE 446: Machine Learning48
Bias contribution
Assume we fit a constant function
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($)
x
y
square feet (sq.ft.)
pri
ce
($)
x
yfŵ(train1)
fŵ(train2)
N house sales ( ,$)
N other house sales ( ,$)
1/11/2017
25
CSE 446: Machine Learning49
Bias contribution
Over all possible size N training sets,what do I expect my fit to be?
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($)
x
y fw(true)
fw ̄
fŵ(train1)
fŵ(train2)
fŵ(train3)
CSE 446: Machine Learning50
Bias contribution
Bias(x) = fw(true)(x) - fw ̄(x)
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($)
x
y
fw ̄
low complexity
high bias
fw(true)
Is our approach flexible enough to capture fw(true)?If not, error in predictions.
1/11/2017
26
CSE 446: Machine Learning51
Variance contribution
How much do specific fits vary from the expected fit?
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($)
x
y
fw ̄
fŵ(train1)
fŵ(train2)
fŵ(train3)
CSE 446: Machine Learning52
Variance contribution
How much do specific fits vary from the expected fit?
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($)
x
y
fw ̄
fŵ(train1)
fŵ(train2)
fŵ(train3)
1/11/2017
27
CSE 446: Machine Learning53
Variance contribution
How much do specific fits vary from the expected fit?
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($)
x
y
low complexity
low variance
Can specific fits vary widely? If so, erratic predictions
CSE 446: Machine Learning54
Variance of high-complexity models
Assume we fit a high-order polynomial
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($)
x
yfŵ(train1)
fŵ(train2)
y
square feet (sq.ft.)
pri
ce
($)
x
1/11/2017
28
CSE 446: Machine Learning55
Variance of high-complexity models
Assume we fit a high-order polynomial
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($)
x
y
fw ̄
fŵ(train1)
fŵ(train3)
fŵ(train2)
CSE 446: Machine Learning56
Variance of high-complexity models
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($)
x
y
fw ̄
high complexity
high variance
1/11/2017
29
CSE 446: Machine Learning57
Bias of high-complexity models
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($)
x
y
fw ̄
fw(true)
high complexity
low bias
CSE 446: Machine Learning58
Bias-variance tradeoff
©2017 Emily Fox
Model complexity
x
y
x
y
1/11/2017
30
CSE 446: Machine Learning59
Error vs. amount of data
©2017 Emily Fox
# data points in training set
Err
or
CSE 446: Machine Learning
Summary of assessing performance
©2017 Emily Fox
1/11/2017
31
CSE 446: Machine Learning61
What you can do now…
• Describe what a loss function is and give examples
• Contrast training, generalization, and test error
• Compute training and test error given a loss function
• Discuss issue of assessing performance on training set
• Describe tradeoffs in forming training/test splits
• List and interpret the 3 sources of avg. prediction error- Irreducible error, bias, and variance
©2017 Emily Fox
CSE 446: Machine Learning
More in depth on the3 sources of errors…
©2017 Emily Fox
1/11/2017
32
CSE 446: Machine Learning63
Accounting for training set randomness
Training set was just a random sample of N houses sold
What if N other houses had been sold and recorded?
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($)
x
y fŵ(1)
square feet (sq.ft.)
pri
ce
($)
x
y fŵ(2)
CSE 446: Machine Learning64
Accounting for training set randomness
Training set was just a random sample of N houses sold
What if N other houses had been sold and recorded?
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($)
x
y
square feet (sq.ft.)
pri
ce
($)
x
yfŵ(1) fŵ(2)
generalization error ofŵ(1) generalization error ofŵ(2)
1/11/2017
33
CSE 446: Machine Learning65
Accounting for training set randomness
Ideally, want performance averaged over all possible training sets of size N
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($)
x
y
square feet (sq.ft.)
pri
ce
($)
x
yfŵ(1) fŵ(2)
generalization error ofŵ(1) generalization error ofŵ(2)
CSE 446: Machine Learning66
Expected prediction error
Etraining set[generalization error of ŵ(training set)]
©2017 Emily Fox
averaging over all training sets (weighted by how likely each is)
parameters fit on a specific training set
square feet (sq.ft.)
pri
ce
($)
x
y fŵ(training set)
1/11/2017
34
CSE 446: Machine Learning67
Start by considering:
1. Loss at target xt (e.g. 2640 sq.ft.)
2. Squared error loss L(y,fŵ(x)) = (y-fŵ(x))2
Prediction error at target input
©2017 Emily Foxsquare feet (sq.ft.)
pri
ce
($)
x
y fŵ(training set)
xt
CSE 446: Machine Learning68
Sum of 3 sources of error
©2017 Emily Foxsquare feet (sq.ft.)
pri
ce
($)
x
y fŵ(training set)
xt
Average prediction error at xt
= σ2 + [bias(fŵ(xt))]2 + var(fŵ(xt))
1/11/2017
35
CSE 446: Machine Learning69
Error variance of the model
Average prediction error at xt
= σ2 + [bias(fŵ(xt))]2 + var(fŵ(xt))
©2017 Emily Fox
σ2 =“variance”
y = fw(true)(x)+ε
square feet (sq.ft.)
pri
ce
($)
x
y
xt
Irreducible error
CSE 446: Machine Learning70
Bias of function estimator
Average prediction error at xt
= σ2 + [bias(fŵ(xt))]2 + var(fŵ(xt))
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($)
x
y
square feet (sq.ft.)
pri
ce
($)
x
y
fŵ(train1) fŵ(train2)
1/11/2017
36
CSE 446: Machine Learning71
Bias of function estimatorAverage estimated function = fw ̄(x)True function = fw(x)
©2017 Emily Foxsquare feet (sq.ft.)
pri
ce
($)
x
y
fw ̄
fw
xt
Etrain[fŵ(train)(x)]
over all training sets of size N
fŵ(train1)fŵ(train2)
CSE 446: Machine Learning72
Bias of function estimatorAverage estimated function = fw ̄(x)True function = fw(x)
©2017 Emily Foxsquare feet (sq.ft.)
pri
ce
($)
x
y
fw ̄
fw
bias(fŵ(xt)) = fw(xt) - fw ̄(xt)
xt
1/11/2017
37
CSE 446: Machine Learning73
Average prediction error at xt
= σ2 + [bias(fŵ(xt))]2 + var(fŵ(xt))
Bias of function estimator
©2017 Emily Fox
CSE 446: Machine Learning74
Average prediction error at xt
= σ2 + [bias(fŵ(xt))]2 + var(fŵ(xt))
square feet (sq.ft.)
pri
ce (
$)
x
y
Variance of function estimator
©2017 Emily Fox
fw ̄
fŵ(train1)
fŵ(train2)
fŵ(train3)
1/11/2017
38
CSE 446: Machine Learning75
Average prediction error at xt
= σ2 + [bias(fŵ(xt))]2 + var(fŵ(xt))
square feet (sq.ft.)
pri
ce (
$)
x
y
Variance of function estimator
©2017 Emily Fox
fw ̄
xt
CSE 446: Machine Learning76
var(fŵ(xt)) = Etrain[(fŵ(train)(xt)-fw ̄(xt))2]
Variance of function estimator
©2017 Emily Fox
fw ̄
square feet (sq.ft.)
pri
ce (
$)
x
y
xt
over all training sets of size N
deviation of specific fit from expected fit at xt
fit on a specifictraining dataset
what I expect to learn over all training sets
1/11/2017
39
CSE 446: Machine Learning
Why 3 sources of error?A formal derivation
©2017 Emily Fox
CSE 446: Machine Learning78
Deriving expected prediction errorExpected prediction error
= Etrain [generalization error of ŵ(train)]
= Etrain [Ex,y[L(y,fŵ(train)(x))]]
1. Look at specific xt
2. Consider L(y,fŵ(x)) = (y-fŵ(x))2
Expected prediction error at xt
= Etrain, [(yt-fŵ(train)(xt))2]
©2017 Emily Fox
yt
1/11/2017
40
CSE 446: Machine Learning79
Deriving expected prediction error
Expected prediction error at xt
= Etrain, [(yt-fŵ(train)(xt))2]
= Etrain, [((yt-fw(true)(xt)) + (fw(true)(xt)-fŵ(train)(xt)))2]
©2017 Emily Fox
yt
yt
CSE 446: Machine Learning80
Equating MSE with bias and variance
MSE[fŵ(train)(xt)]= Etrain[(fw(true)(xt) – fŵ(train)(xt))2]
= Etrain[((fw(true)(xt)–fw ̄(xt)) + (fw ̄(xt)–fŵ(train)(xt)))2]
©2017 Emily Fox
1/11/2017
41
CSE 446: Machine Learning81
Putting it all together
Expected prediction error at xt
= σ2 + MSE[fŵ(xt)]= σ2 + [bias(fŵ(xt))]2 + var(fŵ(xt))
©2017 Emily Fox
3 sources of error