Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830:...

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Lecture 19 of 42

Wednesday, 05 March 2008

William H. HsuDepartment of Computing and Information Sciences, KSU

KSOL course pages: http://snurl.com/1ydii / http://snipurl.com/1y5ihCourse web site: http://www.kddresearch.org/Courses/Spring-2008/CIS732

Instructor home page: http://www.cis.ksu.edu/~bhsu

Reading:Today: Section 6.15, Han & Kamber 2e

Friday: Sections 7.1 – 7.3, Han & Kamber 2e

Model Selection and Performance Measurement:Confidence and ROC Curves

http://snurl.com/1ydii

http://snipurl.com/1y5ih

http://www.kddresearch.org/Courses/Spring-2008/CIS732

http://www.cis.ksu.edu/~bhsu



Eric-Jan Wagenmakers

Practical Methods for Model Selection:

Cross-Validation, Bootstrap, Prequential



Problem

How to compare rival models that differ in complexity (e.g., the number and functional form of parameters)?

Model X?

Model Y?

Model Z?

http://images.google.com/imgres?imgurl=http://www.sculpturegallery.com/austin3/thinker_lg1.jpg&imgrefurl=http://www.sculpturegallery.com/sculpture/the_thinker.html&h=360&w=355&sz=59&hl=en&start=13&tbnid=hQHYKVLJ5tvqOM:&tbnh=121&tbnw=119&prev=/images%3Fq%3Dthinker%26svnum%3D10%26hl%3Den%26lr%3D



Solution?

Use model selection methods such as:Minimum Description Length (MDL)Bayesian Model Selection (BMS)These methods have a sound theoretical basis

I like MDL!I like BMS!



New Problems!

What if it is impossible to calculate what MDL/BMS require?

What if your favorite model does not have a likelihood function?

I like MDL!I like BMS!



Aim of this Presentation

Stress importance of predictionPresent practical methods for model selection. These

are Data-driven Replace theoretical analyses with computing power Relatively easy to implement

Highlight pros and consAdvocate prequential approach



Why Prediction?

The ideal model captures all of the replicable structure in the data, and none of the noise. Models capturing too little structure (underfitting): poor

prediction. Models capturing too much noise (overfitting): poor

prediction.MDL/BMS have a predictive interpretation.



Data-Driven Methods

Split-sample or hold-out method Cross-validation:

Split-half Leave-one-out K-fold Delete-d

Bootstrap model selectionPrequential

These methods can give contradictory results, even asymptotically.



Split-Sample Method [1]

Data

Training setCalibration set

Test setValidation set

Main idea: predictive virtues can only be assessed for unseen data



Split-Sample Method [2]

Training set Test set

Often used in neural networks.The training set and the test set do not change roles: there is no “crossing”.Only one part of the data is ever used for fitting.High variance.



Cross-Validation: Split-half

Data

Training set Test setTest set Training set

Prediction error is average performance on the two training sets.



Cross-Validation: Split-half

All the data are used for fitting (but not at the same time, of course).

Variance may be reduced by repeatedly splitting the data in different halves and averaging the results.

Prediction is based on a relatively small data set; This gives relatively large prediction errors.



Cross-Validation: Leave-one-out [1]

Data (n observations)

Test set = a single observationTraining set = all the rest

Prediction error is average performance on the n training sets.



All the data are used for fitting (again, not at the same time).

Prediction is based on a large data set; This gives small prediction errors.

Problem: as n grows large, the method overfits (i.e., it does not converge on the correct model, in the case that there is one – that is, the method is not consistent).

Sometimes, the method can have high variance.

Cross-Validation: Leave-one-out [2]



Data

Test set

Successively setting apart a block of data.(instead of a single observation)

Test set Test set Test set

Cross-Validation: k-fold



Same as K-fold, except that the test blocks consist of every subset of d observations from the data.

How should we choose d?

Suppose you want the method to be consistent. Then, as n grows large, d/n should approach 1 (Shao, 1997).

Cross-Validation: Delete-d (or Leave-d-Out)



Data: {5, 6, 7, 8}

Bootstrap sample 1: {5, 5, 6, 6}Bootstrap sample 2: {6, 7, 7, 7}Bootstrap sample 3: {5, 7, 8, 8}

Bootstrap: sampling with replacement from the original data (Efron).

Bootstrap Model Selection [1]:Example



Take a bootstrap sample.

For all samples that do not contain observation i, calculate the prediction error for i.

Do this separately for all observations (i = 1,…,n).

Average the prediction errors.

Bootstrap Model Selection [2]:Example



Previous recipe is close to split-half CV. Why? Bootstrap samples are supported on about .632*n of the original data points.A correction is known as the .632 estimator.A better correction is known as the .632+ estimator.

Main idea of bootstrap model selection is to smooth the leave-one-out cross-validation estimate of prediction error.

Bootstrap Complications



Cross-validation is intuitively attractive, but it leaves open a crucial decision: How big should the training set be? Relatively large training sets lead to overfitting, relatively small training sets lead to underfitting.

Cross-validation lacks a guiding principle. What works in one situation is not guaranteed to work in a different situation.

Interim Conclusion



In BMS, one calculates the probability of the observed data under the model of interest.

The model that assigns the highest probability to the observed data is best supported by those data.

It does not matter whether the data arrive one-by-one or all at once!

Back to Bayes [1]



1 2 2 1 1, |P x x P x x P x

It does not matter whether the data arrive one-by-one or all at once!

1 2 3 3 1 2 2 1 1, , | , |P x x x P x x x P x x P x

Sequential probability forecasts = BMS.

Back to Bayes [2]



Sequential probability forecasts = MDL.

Rissanen showed that MDL can be given a predictive interpretation (PMDL) – through sequential forecasts.

Back to Minimum Description Length (MDL)



Sequential forecasts should form the core of statistical inference.Prequential principle, prequential probability.Somehow never really caught on.

The idea is very simple: Assess one-step-ahead accumulative prediction error (APE).APE has a firm theoretical foundation (BMS, MDL, Prequential).

Prequential Approach (Dawid, 1984)



Assess prediction error to unseen data by pretending a subset of the data is unseen.

As in cross-validation, a model is fitted only to a subset of the data. Predictive performance is evaluated for the data the model did not fit.

Unlike cross-validation, predictive performance only concerns the very next data point. Also, the data used for fitting the model gradually increase (i.e., sequential prediction).

Accumulative One-Step-AheadPrediction Error (APE)



Now let’s calculate the accumulative prediction error for a given weather forecasting method M.

APE Example [1]



Given Monday, what will the weather on Tuesday be like, as predicted by forecasting method M?

APE Example [2]



Given Monday and Tuesday, what will the weather on Wednesday be like, as predicted by M?

prediction

4

APE Example [3]



Given Monday-Wednesday, what will the weather on Thursday be like, as predicted by M?

prediction

4

6

APE Example [4]



Given Monday-Thursday, what will the weather on Friday be like, as predicted by M?

prediction

64

3

APE Example [5]



prediction

6 3

24

APE Example [6]

Given Monday-Friday, what will the weather on Saturday be like, as predicted by M?



prediction

Given Monday-Saturday, what will the weather on Sunday be like, as predicted by M?

6

42

1

3

APE Example [7]



Hence, APE = 4 + 6 + 3 + 2 + 1 + 0 = 16(assuming point prediction and absolute loss)

prediction

6

43

2

1 0

APE Example [8]



Based on the first i–1 observations, with i–1 always large enough to make the model identifiable, calculate a prediction for the next observation i.

Calculate the prediction error for observation i, for instance, take the squared difference between the predicted value and the observed value.

Increase i by 1 and repeat the above two steps until i = n. Sum all of the one-step-ahead prediction errors.

APE Recipe



APE is a data-driven method that quite directly assesses the quantity of interest, that is, prediction error to unseen data. It does not require an explicit penalty term. The penalty is in the prediction!

APE is inspired by Predictive Minimum Descriptive Length, Bayesian Model Selection, and Dawid’s Prequential Principle.

Accumulative Prediction Error:Advantages [1]



APE quantifies predictive performance for a particular data set, without relying on asymptotics, and without requiring that the data-generating model is among the candidate models.

Should I go APE?

Accumulative Prediction Error:Advantages [2]



APE is sensitive to the order of the observations (not asymptotically though).

For large data sets and many candidate models the computational effort can be a burden.

ML point estimation may give extreme predictions with very few data (i.e., early in the prediction sequence).

For many loss functions, it may be difficult to relate ΔAPE to the relative confidence in the models.

Accumulative Prediction Error:Disadvantages [2]



Apply APE to hierarchical designs.

Apply the different methods to problems of realistic complexity.

Challenges



Shao, J. (1997). An asymptotic theory for linear model selection. Statistica Sinica, 7, 221-264.

Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B, 36, 111-147.

Stone, M. (1977). Asymptotics for and against cross-validation. Biometrika, 64, 29-35.

References [1]:Cross Validation



Efron, B., & Tibshirani, R. (1997). Improvements on cross-validation: The .632+ bootstrap method. Journal of the American Statistical Association, 92, 548-560.

Efron, B., & Gong, G. (1983). A leisurely look at the bootstrap, the jackknife, and cross-validation. The American Statistician, 37, 36-48.

Shao, J. (1996). Bootstrap model selection. Journal of the American Statistical Association, 91, 655-665.

References [2]:Bootstrap



Dawid, A. P. (1984). The prequential approach. Journal of the Royal Statistical Society, Series A, 147, 278-292.

Wagenmakers, E.-J., Grünwald, P., & Steyvers, M. (2006). Accumulative prediction error and the selection of time series models. Journal of Mathematical Psychology, 50, 149-166.[Contains many other references]

References [3]:Prequential



Actual

PredictedTrue

False

True False

Confusion Matrix

true positives = a, true negatives = dfalse positives = b, false negatives = c



Actual

Predicted

Sensitivity vs. Specificity:Natural Science Terminology

Sensitivity aka Recall Actual positives that were predicted out of total actual positives True positives / (true positives + false negatives)

Specificity Actual negatives that were predicted out of total actual negatives True negatives / (false positives + true negatives)

Sensitivity = a / (a+c)Specificity = d / (b+d)



Actual

PredictedSearch hitreturned

Search hit not returned

Desired Not Desired

Precision vs. Recall:Information Science Terminology

Precision Actual positives that were predicted out of total predicted negatives True positives / (true positives + false positives)

Recall aka Sensitivity Actual positives that were predicted out of total actual positives True positives / (true positives + false negatives)

Precision = a / (a+b)Recall = a / (a+c)



Receiver Operating Characteristic (ROC)

aka Sensitivity vs. Specificity Curve Line plot

Ideal: (0, 1), upper left – false pos rate (FPR) = 0, true pos rate (TPR) = 1 Closer to ideal = better (2-D dominance)

Note: really plot of TPR (sensitivity) vs. FPR (1 – specificity) Plots generated by thresholding predictions of trained inducer

Wikipedia (2008). “Receiver Operating Characteristic”. Wikimedia Foundation. Retrieved March 5, 2008 from http://en.wikipedia.org/wiki/Image:Roc.png.

Controllable parameter is predicted positiverate (therefore, every point represents oneconfusion matrix)



Area Under Curve (ROC AUC):Sensitivity vs. Specificity/Recall

Scalar measure of “performance” Simultaneous measure of sensitivity and specificity/recall Range: 0.0 (worst-case) to 1.0 (corresponding to L shape)

ROC AUC = P( score(actual positive) > score (actual negative) ) Also called discrimination of inducer

Wikipedia (2008). “Receiver Operating Characteristic”. Wikimedia Foundation. Retrieved March 5, 2008 from http://en.wikipedia.org/wiki/Image:Roc space.png.

Examples picked uniformly at random(according to actual distribution) –relative rate, not absolute rate



Summary End of Chapter 6 (Classification)! What Have We Learned So Far?

Supervised: so far Unsupervised and semi-supervised: next Reinforcement learning: later (and perhaps in other courses)

Supervised Learning and Inductive Bias Representation-based vs. search-based (“H-based vs. L-based”) Bias : generalization :: heuristic : search Reference: Mitchell, “Generalization as Search”

Topics Covered: “Inducers” H: conjunctive concepts, decision trees, frequency tables, rule sets, linear

threshold gates, multi-layer perceptrons, example sets, associations L: candidate elimination, ID3/C4.5/J48, Naïve Bayes, CN2/OneR & GABIL,

Perceptron/Winnow & SVM, backprop, nearest-neighbor & RBF, a priori Meta-L: genetic algorithms and evolutionary computation, committee

machines (WM, bagging, boosting, stacking, mixture models/HME)



Terminology Performance Element: Part of System that Applies Result of Learning Flavors of Learning

Supervised: with “teacher” (often, classification from labeled examples) Unsupervised: from data, using similarity measure (unlabeled instances) Reinforcement: “by doing”, with reward/penalty signal

Supervised Learning: Target Functions Target function – function c or f to be learned Target – desired value y to be predicted (sometimes “target function”) Example / labeled instance – tuples of the form <x, f(x)> Classification function, classifier – nominal-valued f (enumerated return

type) Clustering: Application of Unsupervised Learning Concepts and Hypotheses

Concept – function c from observations to TRUE or FALSE (membership) Class label – output of classification function Hypothesis – proposed function h believed to be similar to c (or f)

Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830:...

Documents

Transcript of Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830:...