A Bootstrap Interval Estimator for Bayes’ Classification Error Chad M. Hawes a,b, Carey E. Priebe...

1
A Bootstrap Interval Estimator for Bayes’ Classification Error Chad M. Hawes a,b , Carey E. Priebe a a The Johns Hopkins University, Dept of Applied Mathematics & Statistics b The Johns Hopkins University Applied Physics Laboratory Abstract PMH Distribution Given finite length classifier training set, we propose a new estimation approach that provides an interval estimate of the Bayes’-optimal classification error L*, by: Assuming power-law decay for unconditional error rate of k-nearest neighbor (kNN) classifier Constructing bootstrap-sampled training sets of varying size Evaluating kNN classifier on bootstrap training sets to estimate unconditional error rate Fitting resulting kNN error rate decay as function of training set size to assumed power-law form Standard kNN rule provides upper bound on L* Hellman’s (k,k’) nearest neighbor rule with reject option provides lower bound on L* Result is asymptotic interval estimate of L* using finite sample We apply this L* interval estimator to two classification datasets Motivatio n Approach: Part 1 Pima Indians Knowledge of Bayes’-optimal classification error L* tells us the best any classification rule could do on a given classification problem: Difference between your classifier’s error rate L n and L* indicates how much improvement is possible by changes to your classifier, for a fixed feature set If L* is small and |L n -L*| is large, then it’s worth spending time & money to improve your classifier Knowledge of Bayes’-optimal classification error L* indicates how good our features are for discriminating between our (two) classes: If L* is large and |L n -L*| is small, then better to spend time & money finding better features (changing F XY ) than improving your classifier Estimate of Bayes’ error L* is useful for guiding where to invest time & money for classification rule improvement and feature development Theory Model & Notation We have training data: Conditional probability of error for kNN rule: Finite sample: Asymptotic: Feature Vector: Class Label: We have testing data: We build k-nearest neighbor (kNN) classification rule: denoted as Unconditional probability of error for kNN rule: Finite sample: Asymptotic: Empirical distribution puts mass 1/n on n training samples No approach to estimate Bayes’ error can work for all joint distributions F XY : Devroye 1982 : For any (fixed) integer n, e>0, and classification rule g n there exists a distribution F XY with Bayes’ error L*=0 such that there exist conditions on F XY for which our technique applies Asymptotic kNN-rule error rates form an interval bound on L*: Devijver 1979 : For fixed k: , where lower bound is asymptotic error rate of the kNN-rule with reject option (Hellman 1970) if estimate asymptotic rates w/ finite sample, we have L* estimate KNN-rule’s unconditional error follows known form for class of distributions F XY : Snapp & Venkatesh 1998 : Under regularity conditions on F XY , the finite sample unconditional error rate of the kNN-rule, for fixed k, follows the asymptotic expansion there exists known parametric form for kNN-rule’s error rate decay 1. Construct B bootstrap-sampled training datasets of size n j from D n using For each bootstrap-constructed training dataset, estimate kNN-rule conditional error rate on test set T m , yielding 2. Estimate mean & variance of for training sample size n j : Mean provides estimate of unconditional error rate Variance used for weighted fitting of error rate decay curve 3. Repeat steps 1 and 2 for desired training sample sizes : Yields estimates 4. Construct estimated unconditional error rate decay curve versus training sample size n Approach: Part 2 1. Assume kNN-rule error rates decay according to simple power-lay form: 2. Perform weighted nonlinear least squares fit to constructed error rate curve: Use variance of bootstrapped conditional error rate estimates as weights 3. Resulting forms upper bound for L*: Strong assumption on form of error rate decay enables estimate of asymptotic error rate using only a finite sample 4. Repeat entire procedure using Hellman’s (k,k’) nearest neighbor rule with reject option to form lower bound estimate for L*: This yields interval estimate for Bayes’ classification error as Priebe, Marchette, Healy (PMH) distribution has known L* = 0.0653 Training size n = 200 Test set size m = 200 Symbols are bootstra estimates of unconditional erro rate Interval estimate: UCI Pima Indian Diabetes distribution has unknown L*, d=8: Training size n = 500 Test set size m = 268 Symbols are bootstra estimates of unconditional erro rate Interval estimate: References [1] Devijver, P. “New error bounds with the nearest neighbor rule,” IEEE Tra Informtion Theory, 25, 1979. [2] Devroye, L. “Any discrimination rule can have an arbitrarily bad probabil for finite sample size,” IEEE Trans. on Pattern Analysis & Machine Intelligence , 4, 1 [3] Hellman, M. “The nearest neighbor classification rule with a reject opti Trans. on Systems Science & Cybernetics , 6, 1970. [4] Priebe, C., D. Marchette, & D. Healy. “Integrated sensing and processing trees,” IEEE Trans. on Pattern Analysis & Machine Intelligence , 26, 2004. [5] Snapp, R. & S. Venkatesh. “Asymptotic expansions of the k nearest neighb Annals of Statistics, 26, 1998.

Transcript of A Bootstrap Interval Estimator for Bayes’ Classification Error Chad M. Hawes a,b, Carey E. Priebe...

Page 1: A Bootstrap Interval Estimator for Bayes’ Classification Error Chad M. Hawes a,b, Carey E. Priebe a a The Johns Hopkins University, Dept of Applied Mathematics.

A Bootstrap Interval Estimator for Bayes’ Classification ErrorChad M. Hawesa,b, Carey E. Priebea

a The Johns Hopkins University, Dept of Applied Mathematics & Statisticsb The Johns Hopkins University Applied Physics Laboratory

Abstract PMH Distribution• Given finite length classifier training set, we propose a new estimation

approach that provides an interval estimate of the Bayes’-optimal classification error L*, by:• Assuming power-law decay for unconditional error rate of k-

nearest neighbor (kNN) classifier• Constructing bootstrap-sampled training sets of varying size• Evaluating kNN classifier on bootstrap training sets to estimate

unconditional error rate• Fitting resulting kNN error rate decay as function of training set

size to assumed power-law form• Standard kNN rule provides upper bound on L*• Hellman’s (k,k’) nearest neighbor rule with reject option provides

lower bound on L*• Result is asymptotic interval estimate of L* using finite sample• We apply this L* interval estimator to two classification datasets

Motivation Approach: Part 1 Pima Indians• Knowledge of Bayes’-optimal classification error L* tells us the best

any classification rule could do on a given classification problem:• Difference between your classifier’s error rate Ln and L* indicates

how much improvement is possible by changes to your classifier, for a fixed feature set

• If L* is small and |Ln-L*| is large, then it’s worth spending time & money to improve your classifier

• Knowledge of Bayes’-optimal classification error L* indicates how good our features are for discriminating between our (two) classes:

• If L* is large and |Ln-L*| is small, then better to spend time & money finding better features (changing FXY) than improving your classifier

• Estimate of Bayes’ error L* is useful for guiding where to invest time & money for classification rule improvement and feature development

Theory

Model & NotationWe have training data:

Conditional probability of error for kNN rule:Finite sample:Asymptotic:

Feature Vector:Class Label:

We have testing data:

We build k-nearest neighbor (kNN) classification rule:

denoted as

Unconditional probability of error for kNN rule:Finite sample:Asymptotic:

Empirical distribution puts mass 1/n on n training samples

• No approach to estimate Bayes’ error can work for all joint distributions FXY:

• Devroye 1982: For any (fixed) integer n, e>0, and classification rule gn there exists a distribution FXY with Bayes’ error L*=0 such that

there exist conditions on FXY for which our technique applies

• Asymptotic kNN-rule error rates form an interval bound on L*:• Devijver 1979: For fixed k: , where lower bound is

asymptotic error rate of the kNN-rule with reject option (Hellman 1970)

if estimate asymptotic rates w/ finite sample, we have L* estimate

• KNN-rule’s unconditional error follows known form for class of distributions FXY:

• Snapp & Venkatesh 1998: Under regularity conditions on FXY, the finite sample unconditional error rate of the kNN-rule, for fixed k, follows the asymptotic expansion

there exists known parametric form for kNN-rule’s error rate decay

1. Construct B bootstrap-sampled training datasets of size nj from Dn using • For each bootstrap-constructed training dataset, estimate kNN-rule

conditional error rate on test set Tm, yielding

2. Estimate mean & variance of for training sample size nj:

• Mean provides estimate of unconditional error rate• Variance used for weighted fitting of error rate decay curve

3. Repeat steps 1 and 2 for desired training sample sizes : • Yields estimates

4. Construct estimated unconditional error rate decay curve versus training sample size n

Approach: Part 21. Assume kNN-rule error rates decay according to simple power-lay form:

2. Perform weighted nonlinear least squares fit to constructed error rate curve:• Use variance of bootstrapped conditional error rate estimates as weights

3. Resulting forms upper bound for L*: • Strong assumption on form of error rate decay enables estimate of

asymptotic error rate using only a finite sample

4. Repeat entire procedure using Hellman’s (k,k’) nearest neighbor rule with reject option to form lower bound estimate for L*: • This yields interval estimate for Bayes’ classification error as

Priebe, Marchette, Healy (PMH) distribution has known L* = 0.0653, d=6:

Training size n = 2000

Test set size m = 2000

Symbols are bootstrap estimates of unconditional error rate

Interval estimate:

UCI Pima Indian Diabetes distribution has unknown L*, d=8:

Training size n = 500

Test set size m = 268

Symbols are bootstrap estimates of unconditional error rate

Interval estimate:

References[1] Devijver, P. “New error bounds with the nearest neighbor rule,” IEEE Trans. on Informtion Theory, 25, 1979.

[2] Devroye, L. “Any discrimination rule can have an arbitrarily bad probability of error for finite sample size,” IEEE Trans. on Pattern Analysis & Machine Intelligence, 4, 1982.

[3] Hellman, M. “The nearest neighbor classification rule with a reject option,” IEEE Trans. on Systems Science & Cybernetics, 6, 1970.

[4] Priebe, C., D. Marchette, & D. Healy. “Integrated sensing and processing decision trees,” IEEE Trans. on Pattern Analysis & Machine Intelligence, 26, 2004.

[5] Snapp, R. & S. Venkatesh. “Asymptotic expansions of the k nearest neighbor risk,” Annals of Statistics, 26, 1998.