CSI5388 Current Approaches to Evaluation

11

CSI5388CSI5388Current Approaches Current Approaches

to Evaluationto Evaluation

(Based on Chapter 5 of (Based on Chapter 5 of Mitchell T.., Machine Learning, Mitchell T.., Machine Learning,

1997)1997)

22

MotivationMotivation Evaluating the performance of learning Evaluating the performance of learning

systems is important because:systems is important because:• Learning systems are usually designed Learning systems are usually designed

to predict the class of “future” to predict the class of “future” unlabeled data points.unlabeled data points.

• In some cases, evaluating hypotheses is In some cases, evaluating hypotheses is an integral part of the learning process an integral part of the learning process (example, when pruning a decision tree)(example, when pruning a decision tree)

33

Bias in the estimate:Bias in the estimate: The observed accuracy of The observed accuracy of the learned hypothesis over the training examples the learned hypothesis over the training examples is a poor estimator of its accuracy over future is a poor estimator of its accuracy over future examples ==> we test the hypothesis on a test examples ==> we test the hypothesis on a test set chosen independently of the training set and set chosen independently of the training set and the hypothesis.the hypothesis.

Variance in the estimate: Variance in the estimate: Even with a separate Even with a separate test set, the measured accuracy can vary from the test set, the measured accuracy can vary from the true accuracy, depending on the makeup of the true accuracy, depending on the makeup of the particular set of test examples. The smaller the particular set of test examples. The smaller the test set, the greater the expected variance.test set, the greater the expected variance.

Difficulties in Evaluating Hypotheses when only limited data are available

44

Questions ConsideredQuestions Considered Given the observed accuracy of a hypothesis Given the observed accuracy of a hypothesis

over a limited sample of data, how well does this over a limited sample of data, how well does this estimate its accuracy over additional examples?estimate its accuracy over additional examples?

Given that one hypothesis outperforms another Given that one hypothesis outperforms another over some sample data, how probable is it that over some sample data, how probable is it that this hypothesis is more accurate, in general?this hypothesis is more accurate, in general?

When data is limited what is the best way to use When data is limited what is the best way to use this data to both learn a hypothesis and this data to both learn a hypothesis and estimate its accuracy?estimate its accuracy?

55

Estimating Hypothesis Estimating Hypothesis AccuracyAccuracy

Two Questions of Interest:Two Questions of Interest:• Given a hypothesis Given a hypothesis hh and a data sample and a data sample

containing containing nn examples drawn at random examples drawn at random according to distribution according to distribution DD, what is the , what is the best estimate of the accuracy of best estimate of the accuracy of hh over over future instances drawn from the same future instances drawn from the same distribution? ==> distribution? ==> samplesample vs. vs. truetrue errorerror

• What is the probable error in this accuracy What is the probable error in this accuracy estimate? ==> estimate? ==> confidence intervalsconfidence intervals

66

Sample Error and True ErrorSample Error and True Error Definition 1:Definition 1: The The sample errorsample error (denoted (denoted errorerrorss(h)(h)) of ) of

hypothesis hypothesis hh with respect to target function with respect to target function ff and data and data sample sample S S is:is:

errorerrorss(h)= 1/n (h)= 1/n xxSS(f(x),h(x))(f(x),h(x))

where where nn is the number of examples in is the number of examples in SS, and the quantity , and the quantity (f(x),h(x)) (f(x),h(x)) is 1 ifis 1 if f(x) f(x) h(x) h(x), and 0, otherwise., and 0, otherwise.

Definition 2: Definition 2: The The true errortrue error (denoted (denoted errorerrorDD(h)(h)) of ) of hypothesis hypothesis hh with respect to target function with respect to target function ff and and distribution distribution DD, is the probability that , is the probability that hh will misclassify an will misclassify an instance drawn at random according to instance drawn at random according to DD. .

errorerrorDD(h)= Pr(h)= PrxxDD[f(x)[f(x) h(x)] h(x)]

77

Confidence Intervals for Confidence Intervals for Discrete-Valued HypothesesDiscrete-Valued Hypotheses

The general expression for approximate The general expression for approximate NN%% confidence intervalsconfidence intervals for for errorerrorDD(h)(h) is: is:

errorerrorSS(h) (h) z zNNerrorerrorSS(h)(1-error(h)(1-errorSS(h))/n(h))/n

where where ZZNN is given in [Mitchell, table 5.1] is given in [Mitchell, table 5.1]

This approximation is quite good whenThis approximation is quite good when

n errorn errorSS(h)(1 - error(h)(1 - errorSS(h)) (h)) 5 5

88

Mean and VarianceMean and Variance

Definition 1:Definition 1: Consider a random variable Consider a random variable YY that takes on possible values that takes on possible values yy11, …, y, …, ynn. The . The expected valueexpected value (or (or mean valuemean value) of ) of YY, , E[Y],E[Y], is: is: E[Y] = E[Y] = i=1i=1

nn y yii Pr(Y=y Pr(Y=yii)) Definition 2:Definition 2: The The variancevariance of a random of a random

variable variable YY, , Var[Y],Var[Y], is: . is: . Var[Y] = E[(Y-E[Y])Var[Y] = E[(Y-E[Y])22]]

Definition 3:Definition 3: The The standard deviationstandard deviation of a of a random variable random variable Y Y is the square root of the is the square root of the variance.variance.

99

Estimators, Bias and VarianceEstimators, Bias and Variance Since Since errorerrorSS(h)(h) (an (an estimatorestimator for the true error) obeys a for the true error) obeys a

Binomial distribution (See, [Mitchell, Section 5.3]), we have: Binomial distribution (See, [Mitchell, Section 5.3]), we have: errorerrorSS(h) = r/n(h) = r/n and and errorerrorDD(h) = p(h) = p

where where nn is the number of instances in the sample is the number of instances in the sample SS, , rr is is the number of instances from the number of instances from SS misclassified by misclassified by hh, and , and pp is the probability of misclassifying a single instance is the probability of misclassifying a single instance drawn from drawn from DD..

Definition: The Definition: The estimation biasestimation bias ( ( from the inductive bias) from the inductive bias) of an estimator of an estimator YY for an arbitrary parameter for an arbitrary parameter pp is is E[Y] - pE[Y] - p

The The standard deviationstandard deviation for for errorerrorSS(h)(h) is given by is given by

p(1-p)/np(1-p)/n error errorSS(h)(1-error(h)(1-errorSS(h))/n(h))/n

1010

Difference in Error of two Difference in Error of two HypothesesHypotheses

Let Let hh11 and and hh22 be two hypotheses for some discrete-valued be two hypotheses for some discrete-valued target function. target function. HH11 has been tested on a sample has been tested on a sample SS11 containing containing nn11 randomly drawn examples, and randomly drawn examples, and hh22 has been has been tested on an independent sample tested on an independent sample SS22 containing containing nn22

examples drawn from the same distribution.examples drawn from the same distribution. Let’s estimate the difference between the true errors of Let’s estimate the difference between the true errors of

these two hypotheses, these two hypotheses, dd, by computing the difference , by computing the difference between the sample errors: between the sample errors: ddˆ ˆ = error= errorS1S1(h1)-error(h1)-errorS2S2(h2)(h2)

The approximate The approximate N%N% confidence interval for confidence interval for d d is: is: d^ d^ Z ZNNerrorerrorS1S1(h(h11)(1-error)(1-errorS1S1(h(h11))/n))/n11 + error + errorS2S2(h(h22)(1-)(1-

errorerrorS2S2(h(h22))/n))/n22

1111

Comparing Learning Comparing Learning AlgorithmsAlgorithms

Which ofWhich of L LAA and and LLBB is the better learning method on average for is the better learning method on average for learning some particular target function learning some particular target function f f ??

To answer this question, we wish to estimate the expected value To answer this question, we wish to estimate the expected value of the difference in their errors: of the difference in their errors: EESSD D

[error[errorDD(L(LAA(S))-error(S))-errorDD(L(LBB(S))](S))] Of course, since we have only a limited sample Of course, since we have only a limited sample DD00 we estimate this we estimate this

quantity by dividing quantity by dividing DD00 into a into a training settraining set SS00 and a and a testing set testing set TT00 and measure: and measure: errorerrorT0T0(L(LAA(S(S00))-error))-errorT0T0(L(LBB(S(S00))))

Problem:Problem: We are only measuring the difference in errors for one We are only measuring the difference in errors for one training set training set SS00 rather than the expected value of this difference rather than the expected value of this difference over all samples over all samples SS drawn from drawn from D D

Solution:Solution: k-fold Cross-Validationk-fold Cross-Validation

1212

k-Fold Cross-Validationk-Fold Cross-Validation

1. Partition the available data 1. Partition the available data DD00 intointo kk disjoint subsets disjoint subsets TT11, , TT22, …, T, …, Tkk of equal size, where this size is at least 30. of equal size, where this size is at least 30.

2. For 2. For ii from from 11 to to kk, do, do

use use TTii for the test set, and the remaining data for for the test set, and the remaining data for training set training set SSii

SSii <- {D <- {D00 - T - Tii}} hhA A <- L<- LAA(S(Sii)) hhB B <- L<- LBB(S(Sii)) i <- errori <- errorTiTi(h(hAA)-error)-errorTiTi(h(hBB))

3. Return the value 3. Return the value avg(avg(),), where . where . avg(avg() = 1/k ) = 1/k i=1i=1

k k ii

1313

Confidence of the k-fold Confidence of the k-fold EstimateEstimate

The approximate N% confidence interval for The approximate N% confidence interval for estimating estimating EESSD0 D0 [error[errorDD(L(LAA(S))-error(S))-errorDD(L(LBB(S))] (S))] using using avg(avg(),), is given by: is given by:

avg(avg())ttN,k-1N,k-1ssavg(avg())

where where ttN,k-1N,k-1 is a constant similar to is a constant similar to Z ZNN (See [Mitchell, (See [Mitchell, Table 5.6]) and Table 5.6]) and ssavg(avg() ) is an estimate of the is an estimate of the standard deviation of the distribution governing standard deviation of the distribution governing avg(avg())

ssavg(avg())))==1/k(k-1) 1/k(k-1) i=1i=1k (k (ii -avg( -avg())))22

CSI5388 Current Approaches to Evaluation

Documents

Transcript of CSI5388 Current Approaches to Evaluation