Assessing Performance - University of Washington · 2017-01-30 · Assessing Performance CSE 446:...

1/11/2017

1

CSE 446: Machine Learning

Assessing PerformanceCSE 446: Machine LearningEmily FoxUniversity of WashingtonJanuary 11, 2017

©2017 Emily Fox

CSE 446: Machine Learning2

Make predictions, get $, right??

©2017 Emily Fox

Model

Algorithm

Fit f

Model + algorithm fitted function

Predictions decisions outcome

1/11/2017

2


Or, how much am I losing?

Example: Lost $ due to inaccurate listing price- Too low low offers

- Too high few lookers + no/low offers

How much am I losing compared to perfection?

Perfect predictions: Loss = 0

My predictions: Loss = ???

©2017 Emily Fox


Measuring loss

Loss function:

L(y,fŵ(x))

Examples: (assuming loss for underpredicting = overpredicting)

Absolute error: L(y,fŵ(x)) = |y-fŵ(x)|

Squared error: L(y,fŵ(x)) = (y-fŵ(x))2

©2017 Emily Fox

actual value

Cost of using ŵat x when y is true

= predicted value ŷf(x)

1/11/2017

3

CSE 446: Machine Learning©2017 Emily Fox

“Remember that all models are wrong; the practical question is how wrong do they have

to be to not be useful.” George Box, 1987.


Assessing the loss

©2017 Emily Fox

1/11/2017

4


Assessing the lossPart 1: Training error

©2017 Emily Fox


Define training data

©2017 Emily Fox

square feet (sq.ft.)

pri

ce

($

)

x

y

1/11/2017

5


Define training data

©2017 Emily Fox


pri

ce

($

)

x

y


Example:Fit quadratic to minimize RSS

©2017 Emily Fox


pri

ce

($

)

x

y

ŵminimizesRSS of

training data

1/11/2017

6


Compute training error

1. Define a loss function L(y,fŵ(x))- E.g., squared error, absolute error,…

2. Training error

= avg. loss on houses in training set

= L(yi,fŵ(xi))

©2017 Emily Fox

fit using training data


Example: Use squared error loss (y-fŵ(x))2

©2017 Emily Fox


pri

ce

($

)

x

y

Training error (ŵ) = 1/N *

[($train 1-fŵ(sq.ft.train 1))2

+ ($train 2-fŵ(sq.ft.train 2))2

+ ($train 3-fŵ(sq.ft.train 3))2

+ … include all training houses]

1/11/2017

7


Training error vs. model complexity

©2017 Emily Fox

Model complexity

Err

or


pri

ce

($)

x

y



©2017 Emily Fox

Model complexity

Err

or


pri

ce

($)

x

y

1/11/2017

8



©2017 Emily Fox

Model complexity

Err

or


pri

ce

($)

x

y



©2017 Emily Fox

Model complexity

Err

or


pri

ce

($)

x

y

1/11/2017

9



©2017 Emily Fox

Model complexity

Err

or

x

y

x

y


Is training error a good measure of predictive performance?

How do we expect to perform on a new house?

©2017 Emily Fox


pri

ce (

$)

x

y

1/11/2017

10


Is training error a good measure of predictive performance?

Is there something particularly bad about having xt sq.ft.??

©2017 Emily Fox


pri

ce (

$)

x

y

xt


Is training error a good measure of predictive performance?Issue: Training error is overly optimistic…ŵwas fit to training data

©2017 Emily Fox


pri

ce (

$)

x

y

xt

Small training error ≠> good predictions unless training data includes everything you

might ever see

1/11/2017

11


Assessing the lossPart 2: Generalization (true) error

©2017 Emily Fox


Generalization error

Really want estimate of loss over all possible ( ,$) pairs

©2017 Emily Fox

Lots of housesin neighborhood,but not in dataset

1/11/2017

12


Distribution over houses

In our neighborhood, houses of what # sq.ft. ( ) are we likely to see?

©2017 Emily Fox



Distribution over sales prices

For houses with a given # sq.ft. ( ), what house prices $are we likely to see?

©2017 Emily Fox

price ($)

For fixed # sq.ft.

1/11/2017

13


Generalization error definition

Really want estimate of loss over all possible ( ,$) pairs

Formally:

generalization error = Ex,y[L(y,fŵ(x))]

©2017 Emily Fox

fit using training data

average over all possible(x,y) pairs weighted by

how likely each is


Generalization error vs. model complexity

©2017 Emily Fox

Model complexity

Err

or


pri

ce

($)

x

y

fŵ

1/11/2017

14



©2017 Emily Fox

Model complexity

Err

or


pri

ce

($)

x

y

fŵ



©2017 Emily Fox

Model complexity

Err

or


pri

ce

($)

x

y fŵ

1/11/2017

15



©2017 Emily Fox

Model complexity

Err

or


pri

ce

($)

x

y fŵ



©2017 Emily Fox

Model complexity

Err

or


pri

ce

($)

x

y

fŵ

1/11/2017

16



©2017 Emily Fox

Model complexity

Err

or

x

y

x

y

Can’t compute!


Assessing the lossPart 3: Test error

©2017 Emily Fox

1/11/2017

17


Approximating generalization error

Wanted estimate of loss over all possible ( ,$) pairs

©2017 Emily Fox

Approximate by looking athouses not in training set


Forming a test set

Hold out some ( ,$) that are not used for fitting the model

©2017 Emily Fox

Training set

Test set

1/11/2017

18

CSE 446: Machine Learning35 ©2017 Emily Fox

Training set

Test set

Proxy for “everything you might see”

Hold out some ( ,$) that are not used for fitting the model

Forming a test set


Compute test error

Test error

= avg. loss on houses in test set

= L(yi,fŵ(xi))

©2017 Emily Fox

fit using training data# test points

has never seen test data!

1/11/2017

19


Example: As before, fit quadratic to training data

©2017 Emily Fox


pri

ce

($

)

x

y

ŵminimizesRSS of

training data


Example: As before, use squared error loss (y-fŵ(x))2

©2017 Emily Fox

Test error (ŵ) = 1/Ntest *

[($test 1-fŵ(sq.ft.test 1))2

+ ($test 2-fŵ(sq.ft.test 2))2

+ ($test 3-fŵ(sq.ft.test 3))2

+ … include all test houses]square feet (sq.ft.)

pri

ce

($

)

x

y

1/11/2017

20


Training, true, & test error vs. model complexity

©2017 Emily Fox

Model complexity

Err

or

Overfitting if:

x

y

x

y


Training/test split

©2017 Emily Fox

1/11/2017

21


Training/test splits

©2017 Emily Fox

Training set Test set

how many? how many?vs.



©2017 Emily Fox

Too fewŵpoorly estimated

Training set

Test set

1/11/2017

22


Too few test error bad approximation of generalization error


©2017 Emily Fox

Training setTest set



Typically, just enough test points to form a reasonable estimate of generalization error

If this leaves too few for training, other methods like cross validation (will see later…)

©2017 Emily Fox

Training set Test set

1/11/2017

23


3 sources of error + the bias-variance tradeoff

©2017 Emily Fox


3 sources of error

In forming predictions, there are 3 sources of error:

1. Noise

2. Bias

3. Variance

©2017 Emily Fox

1/11/2017

24


Data inherently noisy

©2017 Emily Fox


pri

ce (

$)

x

y yi = fw(true)(xi)+εi

Irreducible error

varianceof noise


Bias contribution

Assume we fit a constant function

©2017 Emily Fox


pri

ce

($)

x

y


pri

ce

($)

x

yfŵ(train1)

fŵ(train2)

N house sales ( ,$)

N other house sales ( ,$)

1/11/2017

25


Bias contribution

Over all possible size N training sets,what do I expect my fit to be?

©2017 Emily Fox


pri

ce

($)

x

y fw(true)

fw ̄

fŵ(train1)

fŵ(train2)

fŵ(train3)


Bias contribution

Bias(x) = fw(true)(x) - fw ̄(x)

©2017 Emily Fox


pri

ce

($)

x

y

fw ̄

low complexity

high bias

fw(true)

Is our approach flexible enough to capture fw(true)?If not, error in predictions.

1/11/2017

26


Variance contribution

How much do specific fits vary from the expected fit?

©2017 Emily Fox


pri

ce

($)

x

y

fw ̄

fŵ(train1)

fŵ(train2)

fŵ(train3)




©2017 Emily Fox


pri

ce

($)

x

y

fw ̄

fŵ(train1)

fŵ(train2)

fŵ(train3)

1/11/2017

27




©2017 Emily Fox


pri

ce

($)

x

y

low complexity

low variance

Can specific fits vary widely? If so, erratic predictions


Variance of high-complexity models

Assume we fit a high-order polynomial

©2017 Emily Fox


pri

ce

($)

x

yfŵ(train1)

fŵ(train2)

y


pri

ce

($)

x

1/11/2017

28



Assume we fit a high-order polynomial

©2017 Emily Fox


pri

ce

($)

x

y

fw ̄

fŵ(train1)

fŵ(train3)

fŵ(train2)



©2017 Emily Fox


pri

ce

($)

x

y

fw ̄

high complexity

high variance

1/11/2017

29


Bias of high-complexity models

©2017 Emily Fox


pri

ce

($)

x

y

fw ̄

fw(true)

high complexity

low bias


Bias-variance tradeoff

©2017 Emily Fox

Model complexity

x

y

x

y

1/11/2017

30


Error vs. amount of data

©2017 Emily Fox

# data points in training set

Err

or


Summary of assessing performance

©2017 Emily Fox

1/11/2017

31


What you can do now…

• Describe what a loss function is and give examples

• Contrast training, generalization, and test error

• Compute training and test error given a loss function

• Discuss issue of assessing performance on training set

• Describe tradeoffs in forming training/test splits

• List and interpret the 3 sources of avg. prediction error- Irreducible error, bias, and variance

©2017 Emily Fox


More in depth on the3 sources of errors…

©2017 Emily Fox

1/11/2017

32


Accounting for training set randomness

Training set was just a random sample of N houses sold

What if N other houses had been sold and recorded?

©2017 Emily Fox


pri

ce

($)

x

y fŵ(1)


pri

ce

($)

x

y fŵ(2)



Training set was just a random sample of N houses sold

What if N other houses had been sold and recorded?

©2017 Emily Fox


pri

ce

($)

x

y


pri

ce

($)

x

yfŵ(1) fŵ(2)

generalization error ofŵ(1) generalization error ofŵ(2)

1/11/2017

33



Ideally, want performance averaged over all possible training sets of size N

©2017 Emily Fox


pri

ce

($)

x

y


pri

ce

($)

x

yfŵ(1) fŵ(2)

generalization error ofŵ(1) generalization error ofŵ(2)


Expected prediction error

Etraining set[generalization error of ŵ(training set)]

©2017 Emily Fox

averaging over all training sets (weighted by how likely each is)

parameters fit on a specific training set


pri

ce

($)

x

y fŵ(training set)

1/11/2017

34


Start by considering:

1. Loss at target xt (e.g. 2640 sq.ft.)

2. Squared error loss L(y,fŵ(x)) = (y-fŵ(x))2

Prediction error at target input

©2017 Emily Foxsquare feet (sq.ft.)

pri

ce

($)

x

y fŵ(training set)

xt


Sum of 3 sources of error


pri

ce

($)

x

y fŵ(training set)

xt

Average prediction error at xt

= σ2 + [bias(fŵ(xt))]2 + var(fŵ(xt))

1/11/2017

35


Error variance of the model



©2017 Emily Fox

σ2 =“variance”

y = fw(true)(x)+ε


pri

ce

($)

x

y

xt

Irreducible error


Bias of function estimator



©2017 Emily Fox


pri

ce

($)

x

y


pri

ce

($)

x

y

fŵ(train1) fŵ(train2)

1/11/2017

36


Bias of function estimatorAverage estimated function = fw ̄(x)True function = fw(x)


pri

ce

($)

x

y

fw ̄

fw

xt

Etrain[fŵ(train)(x)]

over all training sets of size N

fŵ(train1)fŵ(train2)


Bias of function estimatorAverage estimated function = fw ̄(x)True function = fw(x)


pri

ce

($)

x

y

fw ̄

fw

bias(fŵ(xt)) = fw(xt) - fw ̄(xt)

xt

1/11/2017

37




Bias of function estimator

©2017 Emily Fox





pri

ce (

$)

x

y

Variance of function estimator

©2017 Emily Fox

fw ̄

fŵ(train1)

fŵ(train2)

fŵ(train3)

1/11/2017

38





pri

ce (

$)

x

y


©2017 Emily Fox

fw ̄

xt


var(fŵ(xt)) = Etrain[(fŵ(train)(xt)-fw ̄(xt))2]


©2017 Emily Fox

fw ̄


pri

ce (

$)

x

y

xt

over all training sets of size N

deviation of specific fit from expected fit at xt

fit on a specifictraining dataset

what I expect to learn over all training sets

1/11/2017

39


Why 3 sources of error?A formal derivation

©2017 Emily Fox


Deriving expected prediction errorExpected prediction error

= Etrain [generalization error of ŵ(train)]

= Etrain [Ex,y[L(y,fŵ(train)(x))]]

1. Look at specific xt

2. Consider L(y,fŵ(x)) = (y-fŵ(x))2

Expected prediction error at xt

= Etrain, [(yt-fŵ(train)(xt))2]

©2017 Emily Fox

yt

1/11/2017

40


Deriving expected prediction error


= Etrain, [(yt-fŵ(train)(xt))2]

= Etrain, [((yt-fw(true)(xt)) + (fw(true)(xt)-fŵ(train)(xt)))2]

©2017 Emily Fox

yt

yt


Equating MSE with bias and variance

MSE[fŵ(train)(xt)]= Etrain[(fw(true)(xt) – fŵ(train)(xt))2]

= Etrain[((fw(true)(xt)–fw ̄(xt)) + (fw ̄(xt)–fŵ(train)(xt)))2]

©2017 Emily Fox

1/11/2017

41


Putting it all together


= σ2 + MSE[fŵ(xt)]= σ2 + [bias(fŵ(xt))]2 + var(fŵ(xt))

©2017 Emily Fox

3 sources of error

Assessing Performance - University of Washington · 2017-01-30 · Assessing Performance CSE 446:...

Documents

Transcript of Assessing Performance - University of Washington · 2017-01-30 · Assessing Performance CSE 446:...