CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from...
Transcript of CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from...
![Page 1: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/1.jpg)
Many slides from Andrew Ng, Andrew Moore
CS9840 Learning and Computer Vision Prof.
Olga Veksler
Lecture 10Validation
and Cross‐Validation
![Page 2: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/2.jpg)
Outline
• Performance evaluation and model selection methods• validation
• cross‐validation• k‐fold
• Leave‐one‐out
![Page 3: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/3.jpg)
Regression• In this lecture, it’s convenient
to show examples in the context of regression
• In regression, labels yi are continuous
• Classification/regression are solved very similarly
• Everything we have done so far transfers to regression with very minor changes
• Error: sum of distances from examples to the fitted model
x
y
1 4 8
3
6
![Page 4: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/4.jpg)
Training/Test Data Split• Talked about splitting data in training/test sets
• training data is used to fit parameters
• test data is used to assess how classifier generalizes to new data
• What if classifier has “non‐tunable” parameters? • a parameter is “non‐tunable” if tuning (or training) it
on the training data leads to overfitting
• Examples:• k in kNN classifier
• number of hidden units in MNN
• number of hidden layers in MNN
• etc…
![Page 5: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/5.jpg)
Example of Overfitting• Want to fit a polynomial
machine f(x,w)
• Instead of fixing polynomial degree, make it parameter d• learning machine f(x,w,d)
• Consider just three choices for d• degree 1
• degree 2
• degree 3
x
y
• Training error is a bad measure to choose d• degree 3 is the best according to the training error, but
overfits the data
![Page 6: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/6.jpg)
Training/Test Data Split
• What about test error? Seems appropriate• degree 2 is the best model according to the test error
• Except what do we report as the test error now?
• Test error should be computed on data that was notused for training at all
• Here used “test” data for training, i.e. choosing model
![Page 7: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/7.jpg)
Validation data• Same question when choosing among several classifiers
• our polynomial degree example can be looked at as choosing among 3 classifiers (degree 1, 2, or 3)
• Solution: split the labeled data into three parts
train tunableparameters w
train other parameters, or to select classifier
labeled dataTraining60%
Validation 20%
Test 20%
use only to assess final performance
![Page 8: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/8.jpg)
Training/Validation
Training error: computed on training
examples
Validation error: computed on validation examples
labeled dataTraining60%
Validation 20%
Test 20%
Test error: computed on test examples
![Page 9: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/9.jpg)
Training/Validation/Test Data
• Validation Data
validation error: 3.3 validation error: 1.8 validation error: 3.4
• Test Data• 1.3 test error computed for d = 2
• d = 2 is chosen
• Training Data
d = 1 d = 2 d = 3
![Page 10: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/10.jpg)
Choosing Parameters: Example
• Need to choose number of hidden units for a MNN • The more hidden units, the better can fit training data
• But at some point we overfit the data
number of hidden units
error
Training Error
Validation Error
50
![Page 11: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/11.jpg)
Diagnosing Underfitting/Overfitting
• large training error
• large validation error
Underfitting Just Right• small training error
• small validation error
Overfitting• small training error
• large validation error
![Page 12: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/12.jpg)
Fixing Underfitting/Overfitting
• Fixing Underfitting• getting more training examples will not help
• get more features
• try more complex classifier• if using MNN, try more hidden units
• Fixing Overfitting• getting more training examples might help
• try smaller set of features
• Try less complex classifier• If using MNN, try less hidden units
![Page 13: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/13.jpg)
Train/Test/Validation Method• Good news
• Very simple
• Bad news:• Wastes data
• in general, the more data we have, the better are the estimated parameters
• we estimate parameters on 40% less data, since 20% removed for test and 20% for validation data
• If we have a small dataset our test (validation) set might just be lucky or unlucky
• Cross Validation is a method for performance evaluation that wastes less data
![Page 14: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/14.jpg)
Small Dataset
Linear Model:
Mean Squared Error = 2.4
Quadratic Model:
Mean Squared Error = 0.9x
Join the dots Model:
Mean Squared Error = 2.2
![Page 15: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/15.jpg)
LOOCV (Leave‐one‐out Cross Validation)
x
y
For k=1 to R
1. Let (xk,yk) be the k example
![Page 16: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/16.jpg)
x
y
LOOCV (Leave‐one‐out Cross Validation)
For k=1 to n
1. Let (xk,yk) be the kth example
2. Temporarily remove (xk,yk)from the dataset
![Page 17: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/17.jpg)
x
y
LOOCV (Leave‐one‐out Cross Validation)
For k=1 to n
1. Let (xk,yk) be the kth example
2. Temporarily remove (xk,yk)from the dataset
3. Train on the remaining n‐1 examples
![Page 18: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/18.jpg)
x
y
LOOCV (Leave‐one‐out Cross Validation)
For k=1 to n
1. Let (xk,yk) be the kth example
2. Temporarily remove (xk,yk)from the dataset
3. Train on the remaining n‐1 examples
4. Note your error on (xk,yk)
![Page 19: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/19.jpg)
For k=1 to n
1. Let (xk,yk) be the kth example
2. Temporarily remove (xk,yk)from the dataset
3. Train on the remaining n‐1 examples
4. Note your error on (xk,yk)
When you’ve done all points, report the mean error
x
y
LOOCV (Leave‐one‐out Cross Validation)
![Page 20: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/20.jpg)
x
y
x
y
x
y
x
y
x
y
x
y
x
y
x
y
x
y
MSELOOCV = 2.12
LOOCV (Leave‐one‐out Cross Validation)
![Page 21: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/21.jpg)
LOOCV for Quadratic Regression
x
y
x
y
x
y
x
y
x
y
x
y
x
y
x
y
x
y
MSELOOCV= 0.962
![Page 22: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/22.jpg)
LOOCV for Join The Dots
x
y
x
y
x
y
x
y
x
y
x
y
x
y
x
y
x
y
MSELOOCV=3.33
![Page 23: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/23.jpg)
Which kind of Cross Validation?
Downside Upside
Test‐set may give unreliable estimate of future
performance
cheap
Leave‐one‐out
expensive doesn’t waste data
• Can we get the best of both worlds?
![Page 24: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/24.jpg)
x
y
Randomly break the dataset into k partitions in this example we’ll have k=3 partitions colored Red Green and Blue)
K‐Fold Cross Validation
![Page 25: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/25.jpg)
• Randomly break the dataset into k partitions
• in example have k=3 partitions colored red green and blue
• For the blue partition: train on all points not in the blue partition. Find test‐set sum of errors on blue points
K‐Fold Cross Validation
x
y
![Page 26: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/26.jpg)
• Randomly break the dataset into k partitions
• in example have k=3 partitions colored red green and blue
• For the blue partition: train on all points not in the blue partition. Find test‐set sum of errors on blue points
• For the green partition: train on all points not in green partition. Find test‐set sum of errors on green points
K‐Fold Cross Validation
x
y
![Page 27: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/27.jpg)
• Randomly break the dataset into k partitions
• in example have k=3 partitions colored red green and blue
• For the blue partition: train on all points not in the blue partition. Find test‐set sum of errors on blue points
• For the green partition: train on all points not in green partition. Find test‐set sum of errors on green points
• For the red partition: train on all points not in red partition. Find the test‐set sum of errors on red points
K‐Fold Cross Validation
x
y
![Page 28: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/28.jpg)
Linear Regression MSE3FOLD=2.05
x
y
• Randomly break the dataset into k partitions
• in example have k=3 partitions colored red green and blue
• For the blue partition: train on all points not in the blue partition. Find test‐set sum of errors on blue points
• For the green partition: train on all points not in green partition. Find test‐set sum of errors on green points
• For the red partition: train on all points not in red partition. Find the test‐set sum of errors on red points
• Report the mean error
K‐Fold Cross Validation
![Page 29: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/29.jpg)
Quadratic Regression MSE3FOLD=1.11
K‐Fold Cross Validation• Randomly break the dataset into k
partitions
• in example have k=3 partitions colored red green and blue
• For the blue partition: train on all points not in the blue partition. Find test‐set sum of errors on blue points
• For the green partition: train on all points not in green partition. Find test‐set sum of errors on green points
• For the red partition: train on all points not in red partition. Find the test‐set sum of errors on red points
• Report the mean error
![Page 30: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/30.jpg)
x
y
Joint‐the‐dots MSE3FOLD= 2.93
• Randomly break the dataset into k partitions
• in example have k=3 partitions colored red green and blue
• For the blue partition: train on all points not in the blue partition. Find test‐set sum of errors on blue points
• For the green partition: train on all points not in green partition. Find test‐set sum of errors on green points
• For the red partition: train on all points not in red partition. Find the test‐set sum of errors on red points
• Report the mean error
K‐Fold Cross Validation
![Page 31: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/31.jpg)
Which kind of Cross Validation?Downside Upside
Test‐set may give unreliable estimate of future
performancecheap
Leave‐one‐out
expensive doesn’t waste data
10‐fold wastes 10% of the data,10 times more expensive than
test set
only wastes 10%, only 10 times more expensive instead of n times
3‐fold wastes more data than 10‐fold, more expensive than
test setslightly better than test‐set
N‐fold Identical to Leave‐one‐out
![Page 32: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/32.jpg)
Cross‐validation for classification• Instead of computing the sum squared errors on a test set, you should compute…
from Andrew Moore (CMU)
![Page 33: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/33.jpg)
Cross‐validation for classification• Instead of computing the sum squared errors on a test set, you should compute…
The total number of misclassifications on a testset
from Andrew Moore (CMU)
![Page 34: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/34.jpg)
Cross‐validation for classification• Instead of computing the sum squared errors on a test set, you should compute…
The total number of misclassifications on a testset
• What’s LOOCV of 1-NN?
• What’s LOOCV of 3-NN?
• What’s LOOCV of 22-NN?
from Andrew Moore (CMU)
![Page 35: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/35.jpg)
Cross‐Validation for classification
• Choosing k for k‐nearest neighbors
• Choosing Kernel parameters for SVM
• Any other “free” parameter of a classifier
• Choosing Features to use
• Choosing which classifier to use
from Andrew Moore (CMU)
![Page 36: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/36.jpg)
CV‐based Model Selection• We’re trying to decide which algorithm to use.
• We train each machine and make a table…
fi Training Error 10‐FOLD‐CV Error Choice
f1f2f3 ◊
f4f5f6
![Page 37: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/37.jpg)
CV‐based Model Selection• Example: Choosing “k” for a k‐nearest‐neighbor regression.
• Step 1: Compute LOOCV error for six different model classes:
• Step 2: Choose model that gave best CV score
• Train it with all the data, and that’s the final model you’ll use
Algorithm Training Error 10‐fold‐CV Error Choice
k=1
k=2
k=3
k=4 ◊
k=5
k=6
![Page 38: CS9840 and Computer Vision Prof. Olga Veksleroveksler/Courses/Fall2013/... · Many slides from Andrew Ng, Andrew Moore CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture](https://reader033.fdocuments.in/reader033/viewer/2022042407/5f21bf91e79a4b24ec058d9a/html5/thumbnails/38.jpg)
CV‐based Model Selection• Why stop at k=6?
• No good reason, except it looked like things were getting worse as K was increasing
• Are we guaranteed that a local optimum of K vsLOOCV will be the global optimum?• No, in fact the relationship can be very bumpy
• What should we do if we are depressed at the expense of doing LOOCV for k = 1 through 1000?• Try: k=1, 2, 4, 8, 16, 32, 64, … ,1024
• Then do hillclimbing from an initial guess at k