Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830:...

48
Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42 Wednesday, 05 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU KSOL course pages: http://snurl.com/1ydii / http://snipurl.com/1y5ih Course web site: http://www.kddresearch.org/Courses/Spring-2008/CIS732 Instructor home page: http://www.cis.ksu.edu/~bhsu Reading: Today: Section 6.15, Han & Kamber 2 e Friday: Sections 7.1 – 7.3, Han & Kamber 2 e Model Selection and Performance Measurement Confidence and ROC Curves

description

Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Problem How to compare rival models that differ in complexity (e.g., the number and functional form of parameters)? Model X? Model Y? Model Z?

Transcript of Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830:...

Page 1: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Lecture 19 of 42

Wednesday, 05 March 2008

William H. HsuDepartment of Computing and Information Sciences, KSU

KSOL course pages: http://snurl.com/1ydii / http://snipurl.com/1y5ihCourse web site: http://www.kddresearch.org/Courses/Spring-2008/CIS732

Instructor home page: http://www.cis.ksu.edu/~bhsu

Reading:Today: Section 6.15, Han & Kamber 2e

Friday: Sections 7.1 – 7.3, Han & Kamber 2e

Model Selection and Performance Measurement:Confidence and ROC Curves

Page 2: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Eric-Jan Wagenmakers

Practical Methods for Model Selection:

Cross-Validation, Bootstrap, Prequential

Page 3: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Problem

How to compare rival models that differ in complexity (e.g., the number and functional form of parameters)?

Model X?

Model Y?

Model Z?

Page 4: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Solution?

Use model selection methods such as:Minimum Description Length (MDL)Bayesian Model Selection (BMS)These methods have a sound theoretical basis

I like MDL!I like BMS!

Page 5: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

New Problems!

What if it is impossible to calculate what MDL/BMS require?

What if your favorite model does not have a likelihood function?

I like MDL!I like BMS!

Page 6: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Aim of this Presentation

Stress importance of predictionPresent practical methods for model selection. These

are Data-driven Replace theoretical analyses with computing power Relatively easy to implement

Highlight pros and consAdvocate prequential approach

Page 7: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Why Prediction?

The ideal model captures all of the replicable structure in the data, and none of the noise. Models capturing too little structure (underfitting): poor

prediction. Models capturing too much noise (overfitting): poor

prediction.MDL/BMS have a predictive interpretation.

Page 8: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Data-Driven Methods

Split-sample or hold-out method Cross-validation:

Split-half Leave-one-out K-fold Delete-d

Bootstrap model selectionPrequential

These methods can give contradictory results, even asymptotically.

Page 9: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Split-Sample Method [1]

Data

Training setCalibration set

Test setValidation set

Main idea: predictive virtues can only be assessed for unseen data

Page 10: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Split-Sample Method [2]

Training set Test set

Often used in neural networks.The training set and the test set do not change roles: there is no “crossing”.Only one part of the data is ever used for fitting.High variance.

Page 11: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Cross-Validation: Split-half

Data

Training set Test setTest set Training set

Prediction error is average performance on the two training sets.

Page 12: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Cross-Validation: Split-half

All the data are used for fitting (but not at the same time, of course).

Variance may be reduced by repeatedly splitting the data in different halves and averaging the results.

Prediction is based on a relatively small data set; This gives relatively large prediction errors.

Page 13: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Cross-Validation: Leave-one-out [1]

Data (n observations)

Test set = a single observationTraining set = all the rest

Prediction error is average performance on the n training sets.

Page 14: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

All the data are used for fitting (again, not at the same time).

Prediction is based on a large data set; This gives small prediction errors.

Problem: as n grows large, the method overfits (i.e., it does not converge on the correct model, in the case that there is one – that is, the method is not consistent).

Sometimes, the method can have high variance.

Cross-Validation: Leave-one-out [2]

Page 15: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Data

Test set

Successively setting apart a block of data.(instead of a single observation)

Test set Test set Test set

Cross-Validation: k-fold

Page 16: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Same as K-fold, except that the test blocks consist of every subset of d observations from the data.

How should we choose d?

Suppose you want the method to be consistent. Then, as n grows large, d/n should approach 1 (Shao, 1997).

Cross-Validation: Delete-d (or Leave-d-Out)

Page 17: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Data: {5, 6, 7, 8}

Bootstrap sample 1: {5, 5, 6, 6}Bootstrap sample 2: {6, 7, 7, 7}Bootstrap sample 3: {5, 7, 8, 8}

Bootstrap: sampling with replacement from the original data (Efron).

Bootstrap Model Selection [1]:Example

Page 18: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Take a bootstrap sample.

For all samples that do not contain observation i, calculate the prediction error for i.

Do this separately for all observations (i = 1,…,n).

Average the prediction errors.

Bootstrap Model Selection [2]:Example

Page 19: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Previous recipe is close to split-half CV. Why? Bootstrap samples are supported on about .632*n of the original data points.A correction is known as the .632 estimator.A better correction is known as the .632+ estimator.

Main idea of bootstrap model selection is to smooth the leave-one-out cross-validation estimate of prediction error.

Bootstrap Complications

Page 20: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Cross-validation is intuitively attractive, but it leaves open a crucial decision: How big should the training set be? Relatively large training sets lead to overfitting, relatively small training sets lead to underfitting.

Cross-validation lacks a guiding principle. What works in one situation is not guaranteed to work in a different situation.

Interim Conclusion

Page 21: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

In BMS, one calculates the probability of the observed data under the model of interest.

The model that assigns the highest probability to the observed data is best supported by those data.

It does not matter whether the data arrive one-by-one or all at once!

Back to Bayes [1]

Page 22: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

1 2 2 1 1, |P x x P x x P x

It does not matter whether the data arrive one-by-one or all at once!

1 2 3 3 1 2 2 1 1, , | , |P x x x P x x x P x x P x

Sequential probability forecasts = BMS.

Back to Bayes [2]

Page 23: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Sequential probability forecasts = MDL.

Rissanen showed that MDL can be given a predictive interpretation (PMDL) – through sequential forecasts.

Back to Minimum Description Length (MDL)

Page 24: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Sequential forecasts should form the core of statistical inference.Prequential principle, prequential probability.Somehow never really caught on.

The idea is very simple: Assess one-step-ahead accumulative prediction error (APE).APE has a firm theoretical foundation (BMS, MDL, Prequential).

Prequential Approach (Dawid, 1984)

Page 25: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Assess prediction error to unseen data by pretending a subset of the data is unseen.

As in cross-validation, a model is fitted only to a subset of the data. Predictive performance is evaluated for the data the model did not fit.

Unlike cross-validation, predictive performance only concerns the very next data point. Also, the data used for fitting the model gradually increase (i.e., sequential prediction).

Accumulative One-Step-AheadPrediction Error (APE)

Page 26: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Now let’s calculate the accumulative prediction error for a given weather forecasting method M.

APE Example [1]

Page 27: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Given Monday, what will the weather on Tuesday be like, as predicted by forecasting method M?

APE Example [2]

Page 28: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Given Monday and Tuesday, what will the weather on Wednesday be like, as predicted by M?

prediction

4

APE Example [3]

Page 29: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Given Monday-Wednesday, what will the weather on Thursday be like, as predicted by M?

prediction

4

6

APE Example [4]

Page 30: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Given Monday-Thursday, what will the weather on Friday be like, as predicted by M?

prediction

64

3

APE Example [5]

Page 31: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

prediction

6 3

24

APE Example [6]

Given Monday-Friday, what will the weather on Saturday be like, as predicted by M?

Page 32: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

prediction

Given Monday-Saturday, what will the weather on Sunday be like, as predicted by M?

6

42

1

3

APE Example [7]

Page 33: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Hence, APE = 4 + 6 + 3 + 2 + 1 + 0 = 16(assuming point prediction and absolute loss)

prediction

6

43

2

1 0

APE Example [8]

Page 34: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Based on the first i–1 observations, with i–1 always large enough to make the model identifiable, calculate a prediction for the next observation i.

Calculate the prediction error for observation i, for instance, take the squared difference between the predicted value and the observed value.

Increase i by 1 and repeat the above two steps until i = n. Sum all of the one-step-ahead prediction errors.

APE Recipe

Page 35: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

APE is a data-driven method that quite directly assesses the quantity of interest, that is, prediction error to unseen data. It does not require an explicit penalty term. The penalty is in the prediction!

APE is inspired by Predictive Minimum Descriptive Length, Bayesian Model Selection, and Dawid’s Prequential Principle.

Accumulative Prediction Error:Advantages [1]

Page 36: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

APE quantifies predictive performance for a particular data set, without relying on asymptotics, and without requiring that the data-generating model is among the candidate models.

Should I go APE?

Accumulative Prediction Error:Advantages [2]

Page 37: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

APE is sensitive to the order of the observations (not asymptotically though).

For large data sets and many candidate models the computational effort can be a burden.

ML point estimation may give extreme predictions with very few data (i.e., early in the prediction sequence).

For many loss functions, it may be difficult to relate ΔAPE to the relative confidence in the models.

Accumulative Prediction Error:Disadvantages [2]

Page 38: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Apply APE to hierarchical designs.

Apply the different methods to problems of realistic complexity.

Challenges

Page 39: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Shao, J. (1997). An asymptotic theory for linear model selection. Statistica Sinica, 7, 221-264.

Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B, 36, 111-147.

Stone, M. (1977). Asymptotics for and against cross-validation. Biometrika, 64, 29-35.

References [1]:Cross Validation

Page 40: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Efron, B., & Tibshirani, R. (1997). Improvements on cross-validation: The .632+ bootstrap method. Journal of the American Statistical Association, 92, 548-560.

Efron, B., & Gong, G. (1983). A leisurely look at the bootstrap, the jackknife, and cross-validation. The American Statistician, 37, 36-48.

Shao, J. (1996). Bootstrap model selection. Journal of the American Statistical Association, 91, 655-665.

References [2]:Bootstrap

Page 41: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Dawid, A. P. (1984). The prequential approach. Journal of the Royal Statistical Society, Series A, 147, 278-292.

Wagenmakers, E.-J., Grünwald, P., & Steyvers, M. (2006). Accumulative prediction error and the selection of time series models. Journal of Mathematical Psychology, 50, 149-166.[Contains many other references]

References [3]:Prequential

Page 42: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Actual

PredictedTrue

False

True False

Confusion Matrix

true positives = a, true negatives = dfalse positives = b, false negatives = c

Page 43: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Actual

Predicted

Sensitivity vs. Specificity:Natural Science Terminology

Sensitivity aka Recall Actual positives that were predicted out of total actual positives True positives / (true positives + false negatives)

Specificity Actual negatives that were predicted out of total actual negatives True negatives / (false positives + true negatives)

Sensitivity = a / (a+c)Specificity = d / (b+d)

Page 44: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Actual

PredictedSearch hitreturned

Search hit not returned

Desired Not Desired

Precision vs. Recall:Information Science Terminology

Precision Actual positives that were predicted out of total predicted negatives True positives / (true positives + false positives)

Recall aka Sensitivity Actual positives that were predicted out of total actual positives True positives / (true positives + false negatives)

Precision = a / (a+b)Recall = a / (a+c)

Page 45: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Receiver Operating Characteristic (ROC)

aka Sensitivity vs. Specificity Curve Line plot

Ideal: (0, 1), upper left – false pos rate (FPR) = 0, true pos rate (TPR) = 1 Closer to ideal = better (2-D dominance)

Note: really plot of TPR (sensitivity) vs. FPR (1 – specificity) Plots generated by thresholding predictions of trained inducer

Wikipedia (2008). “Receiver Operating Characteristic”. Wikimedia Foundation. Retrieved March 5, 2008 from http://en.wikipedia.org/wiki/Image:Roc.png.

Controllable parameter is predicted positiverate (therefore, every point represents oneconfusion matrix)

Page 46: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Area Under Curve (ROC AUC):Sensitivity vs. Specificity/Recall

Scalar measure of “performance” Simultaneous measure of sensitivity and specificity/recall Range: 0.0 (worst-case) to 1.0 (corresponding to L shape)

ROC AUC = P( score(actual positive) > score (actual negative) ) Also called discrimination of inducer

Wikipedia (2008). “Receiver Operating Characteristic”. Wikimedia Foundation. Retrieved March 5, 2008 from http://en.wikipedia.org/wiki/Image:Roc space.png.

Examples picked uniformly at random(according to actual distribution) –relative rate, not absolute rate

Page 47: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Summary End of Chapter 6 (Classification)! What Have We Learned So Far?

Supervised: so far Unsupervised and semi-supervised: next Reinforcement learning: later (and perhaps in other courses)

Supervised Learning and Inductive Bias Representation-based vs. search-based (“H-based vs. L-based”) Bias : generalization :: heuristic : search Reference: Mitchell, “Generalization as Search”

Topics Covered: “Inducers” H: conjunctive concepts, decision trees, frequency tables, rule sets, linear

threshold gates, multi-layer perceptrons, example sets, associations L: candidate elimination, ID3/C4.5/J48, Naïve Bayes, CN2/OneR & GABIL,

Perceptron/Winnow & SVM, backprop, nearest-neighbor & RBF, a priori Meta-L: genetic algorithms and evolutionary computation, committee

machines (WM, bagging, boosting, stacking, mixture models/HME)

Page 48: Computing & Information Sciences Kansas State University Wednesday, 05 Mar 2008 CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 19 of 42.

Computing & Information SciencesKansas State UniversityWednesday, 05 Mar

2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Terminology Performance Element: Part of System that Applies Result of Learning Flavors of Learning

Supervised: with “teacher” (often, classification from labeled examples) Unsupervised: from data, using similarity measure (unlabeled instances) Reinforcement: “by doing”, with reward/penalty signal

Supervised Learning: Target Functions Target function – function c or f to be learned Target – desired value y to be predicted (sometimes “target function”) Example / labeled instance – tuples of the form <x, f(x)> Classification function, classifier – nominal-valued f (enumerated return

type) Clustering: Application of Unsupervised Learning Concepts and Hypotheses

Concept – function c from observations to TRUE or FALSE (membership) Class label – output of classification function Hypothesis – proposed function h believed to be similar to c (or f)