An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science...

42
2.20.2006 CS 678 1 An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants Eric Bauer Ron Kohavi Computer Science Department Data Mining and Visualization Stanford University Silicon Graphics Inc. Presented by: Nick Hamatake Karan Singh

Transcript of An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science...

Page 1: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 1

An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants

Eric Bauer Ron KohaviComputer Science Department Data Mining and Visualization

Stanford University Silicon Graphics Inc.

Presented by:Nick Hamatake

Karan Singh

Page 2: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 2

Motivation

¡ Methods for voting classification algorithms show much promise

¡ Review algorithms and present a large empirical study comparing several variants

¡ Improve understanding of why and when these algorithms work

¡ Explore practical problems

Page 3: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 3

Outline

¡ Introduction¡ The base inducers¡ The voting algorithms¡ Bias/Variance decomposition¡ Experimental design¡ Bagging and variants¡ Boosting: AdaBoost and Arc-x4¡ Future work¡ Conclusions

Page 4: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 4

Outline

¡ Introduction¡ The base inducers¡ The voting algorithms¡ Bias/Variance decomposition¡ Experimental design¡ Bagging and variants¡ Boosting: AdaBoost and Arc-x4¡ Future work¡ Conclusions

Page 5: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 5

Introduction

¡ Two types of voting algorithmsl Perturb and Combinel Adaptively Resample and Combine

¡ Authors describe a large empirical study to understand why and when algorithms affect classification error

Page 6: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 6

Outline

¡ Introduction¡ The base inducers¡ The voting algorithms¡ Bias/Variance decomposition¡ Experimental design¡ Bagging and variants¡ Boosting: AdaBoost and Arc-x4¡ Future work¡ Conclusions

Page 7: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 7

The base inducers

¡ 4 base inducers, implemented in MLC++

¡ Decision treesl MC4 (MLC++ C4.5)l MC4(1) (decision stump)l MC4(1)-disc (entropy discretization)

¡ Naïve-Bayes

Page 8: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 8

Outline

¡ Introduction¡ The base inducers¡ The voting algorithms¡ Bias/Variance decomposition¡ Experimental design¡ Bagging and variants¡ Boosting: AdaBoost and Arc-x4¡ Future work¡ Conclusions

Page 9: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 9

Bagging (Brieman 1996)

Page 10: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 10

AdaBoost.M1 (Freund & Schapire 1995)

Page 11: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 11

Arc-x4 (Breiman 1996)

Page 12: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 12

Outline

¡ Introduction¡ The base inducers¡ The voting algorithms¡ Bias/Variance decomposition¡ Experimental design¡ Bagging and variants¡ Boosting: AdaBoost and Arc-x4¡ Future work¡ Conclusions

Page 13: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 13

Bias/Variance decomposition

Page 14: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 14

Bias/Variance decomposition

Page 15: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 15

Bias/Variance decomposition

¡ Kohavi and Wolpert (1996)l two-stage sampling procedure

1) test set is split from from training set2) remaining data, D, is sampled

repeatedly to estimate bias/variance onthe test set

l whole process can be repeated multiple times to improve estimates

Page 16: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 16

Outline

¡ Introduction¡ The base inducers¡ The voting algorithms¡ Bias/Variance decomposition¡ Experimental design¡ Bagging and variants¡ Boosting: AdaBoost and Arc-x4¡ Future work¡ Conclusions

Page 17: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 17

Experimental design

¡ all data sets have at least 1000 instances¡ training set size selected at a point where

error was still decreasing¡ at least half of the data used for test set¡ only 25 sub-classifiers voted for each

algorithm¡ authors verified their implementations of

voting algorithms against original papers

Page 18: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 18

Experimental design: datasets and training set sizes used

Page 19: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 19

Page 20: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 20

Outline

¡ Introduction¡ The base inducers¡ The voting algorithms¡ Bias/Variance decomposition¡ Experimental design¡ Bagging and variants¡ Boosting: AdaBoost and Arc-x4¡ Future work¡ Conclusions

Page 21: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 21

Page 22: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 22

Bagging Results

¡ Performance boost across the board¡ Most gain due to reduced variance

(but some bias)¡ Biggest effect on many-attr sets¡ Little effect on large training sets¡ Increased tree size (198 to 240)¡ Naïve Bayes - little to no change

l 3% average relative decrease in errorl Already highly-stable technique

Page 23: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 23

Page 24: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 24

Why the increased tree size?

¡ Theory: Less pruning due to duplicate data

¡ Test: Bagged trees actually smaller (25%) before pruning

¡ Idea: Skip pruningl Pruning dampens variance while

increasing biasl Bagging does well with high variancel Results: 11% increase in rel. error for

un-bagged trees; slight decrease in error for bagged

Page 25: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 25

Bagging Probabilities (p-Bagging)

¡ Motivation: MC4 can generative probabilities; why not use them?

¡ Skipping pruning process should help predict prob.'s even better (why?)

¡ Worked welll 2%, 4%, 8% rel. reduction in rel. error

for MC4, MC4(1), MC4(1)-disc.l No significant change for Naive-Bayes

(when alg. tuned to predict prob.'s)

Page 26: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 26

Wagging and Backfitting

¡ Wagging - Weight Aggregationl Different method of perturbing train set

-- noise vs. re-samplingl Tunable: higher variance in noise leads

to reduced variance (higher bias)l Performs on-par with bagging when

tuned optimally

¡ Backfitting - Use whole training set to calculate leaf probabilitiesl Avg. rel. var. decrease of 11% (3% total

error); improvement (or no change) in all models

Page 27: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 27

Bagging: Summary

¡ Three new variants: disabling pruning, p-Bagging, backfitting

¡ Each variant improved performance; 4% rel. decrease in error at the end

¡ Surprising reduce in bias for stumps¡ Naïve Bayes not helped much¡ MSE improved dramatically with p-

Bagging

Page 28: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 28

Outline

¡ Introduction¡ The base inducers¡ The voting algorithms¡ Bias/Variance decomposition¡ Experimental design¡ Bagging and variants¡ Boosting: AdaBoost and Arc-x4¡ Future work¡ Conclusions

Page 29: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 29

Boosting: AdaBoost and Arc-x4

¡ but first, an example:l to get a better understanding of the

process of boostingl highlight the issue of numerical

instabilities and underflows

Page 30: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 30

Example

¡ “A training set of size 5000 was used with the shuttle dataset. The MC4 algorithm already has relatively small test error of 0.38%.”

¡ Update rule:For each xj, divide weight(xj) by 2єiif Cj(xj) ≠ yj and 2(1-єi) otherwise

Page 31: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 31

Page 32: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 32

Example: lesson learned …

¡ using the update rule that avoids the normalization step circumvents underflows early in the process, but they can still happen

¡ authors set instances with weights falling below the minimum weight to have the minimum weight

Page 33: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 33

Boosting: AdaBoost results

¡ AdaBoosted MC4 gave average relative error reduction of 27%

¡ AdaBoost beat backfit-p-Bagging(average absolute error: 9.8% vs 10.1%)

¡ AdaBoost was notuniformly betterfor all datasets

Page 34: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 34

Page 35: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 35

Boosting: AdaBoost results

¡ MC4(1) with boosting fails¡ AdaBoosted MC4(1)-disc gave

average relative error reductionof 31%

¡ AdaBoosted Naïve-Bayes gave average relative error reductionof 24%

Page 36: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 36

Boosting: Arc-x4 results

¡ two versions:l Arc-x4-reweightl Arc-x4-resample

¡ Arc-x4-resample MC4 9.81% avg error¡ Arc-x4-reweight MC4 10.86% avg error¡ Arc-x4-reweight has higher variance¡ does worse on same datasets as

AdaBoost, likely due to noise¡ Arc-x4 outperforms AdaBoost for

MC4(1)-disc and Naïve-Bayes

Page 37: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 37

Boosting: summary

Page 38: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 38

Boosting: summary

¡ on average, boosting is better than bagging for the given datasets

¡ boosting is not uniformly better than bagging

¡ AdaBoost does not deal well with noise¡ both algorithms reduced bias and

variance for decision trees; increased variance for naïve-bayes

Page 39: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 39

Outline

¡ Introduction¡ The base inducers¡ The voting algorithms¡ Bias/Variance decomposition¡ Experimental design¡ Bagging and variants¡ Boosting: AdaBoost and Arc-x4¡ Future work¡ Conclusions

Page 40: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 40

Future work

¡ Improving robustness of boosting w.r.t. noise

¡ Are there better ways to handle a “perfect” classifier when boosting?

¡ Bagging & boosting seem to drift from Occam's razor; can we quantify this bias toward complex models?

¡ Are there cases when pruning helps bagging?

¡ Can stacking improve bagging & boosting even further?

¡ Applying boosting to other methods: kNN¡ Other techniques of combining probabilities¡ Can boosting be parallelized?

Page 41: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 41

Outline

¡ Introduction¡ The base inducers¡ The voting algorithms¡ Bias/Variance decomposition¡ Experimental design¡ Bagging and variants¡ Boosting: AdaBoost and Arc-x4¡ Future work¡ Conclusions

Page 42: An Empirical Comparison of Voting Classification ...€¦ · Eric BauerRon Kohavi Computer Science DepartmentData Mining and Visualization Stanford UniversitySilicon Graphics Inc.

2.20.2006 CS 678 42

Conclusions

¡ Performance of bagging and boosting depends highly on the data set

¡ Bagging very helpful w.r.t. MSE¡ Larger trees in AdaBoost trials

correlated with greater success in reducing error

¡ Arc-x4 works far better with re-sampling, unlike other boosting methods