Maximizing Classifier Utility when Training Data is Costly

33
Maximizing Classifier Maximizing Classifier Utility when Training Utility when Training Data is Costly Data is Costly Gary M. Weiss Gary M. Weiss Ye Tian Ye Tian Fordham University Fordham University

description

Maximizing Classifier Utility when Training Data is Costly. Gary M. Weiss Ye Tian Fordham University. Outline. Introduction Motivation, cost model Experimental Methodology Results Adult data set Progressive Sampling Related Work Future Work/Conclusion. Motivation. - PowerPoint PPT Presentation

Transcript of Maximizing Classifier Utility when Training Data is Costly

Page 1: Maximizing Classifier Utility when Training Data is Costly

Maximizing Classifier Utility Maximizing Classifier Utility when Training Data is Costlywhen Training Data is Costly

Gary M. WeissGary M. WeissYe TianYe Tian

Fordham UniversityFordham University

Page 2: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 2UBDM 2006 Workshop

OutlineOutline

IntroductionIntroduction Motivation, cost modelMotivation, cost model

Experimental MethodologyExperimental Methodology ResultsResults

Adult data setAdult data set Progressive SamplingProgressive Sampling Related WorkRelated Work Future Work/ConclusionFuture Work/Conclusion

Page 3: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 3UBDM 2006 Workshop

MotivationMotivation Utility-Based Data MiningUtility-Based Data Mining

Concerned with utility of overall data mining processConcerned with utility of overall data mining process A key cost is the cost of training dataA key cost is the cost of training data

These costs often ignored (except for active learning)These costs often ignored (except for active learning) First ones to analyze the impact of a very simple cost modelFirst ones to analyze the impact of a very simple cost model

In doing so we fill a hole in existing researchIn doing so we fill a hole in existing research

Our cost modelOur cost model A fixed cost for acquiring labeled training examplesA fixed cost for acquiring labeled training examples

No separate cost for class labels, missing features, etc.No separate cost for class labels, missing features, etc. TurneyTurney11 called this the “cost of cases” called this the “cost of cases” No control over which training examples chosenNo control over which training examples chosen

No active learningNo active learning

Page 4: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 4UBDM 2006 Workshop

Motivation (cont.)Motivation (cont.)

Efficient progressive samplingEfficient progressive sampling22

Determines “optimal” training set sizeDetermines “optimal” training set size Optimal is where the learning curve reaches a plateauOptimal is where the learning curve reaches a plateau Assumes data acquisition costs are essentially zeroAssumes data acquisition costs are essentially zero

What if the acquisition costs are significant?What if the acquisition costs are significant?

Page 5: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 5UBDM 2006 Workshop

Motivating ExamplesMotivating Examples

Predicting customer behavior/buying potentialPredicting customer behavior/buying potential Training data from D&B and Ziff-DavisTraining data from D&B and Ziff-Davis These and other “information vendors” make These and other “information vendors” make

money by selling informationmoney by selling information Poker playingPoker playing

Learn about an opponent by playing himLearn about an opponent by playing him

Page 6: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 6UBDM 2006 Workshop

OutlineOutline

IntroductionIntroduction Motivation, cost modelMotivation, cost model

Experimental MethodologyExperimental Methodology ResultsResults

Adult data setAdult data set Progressive SamplingProgressive Sampling Related WorkRelated Work Future Work/ConclusionFuture Work/Conclusion

Page 7: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 7UBDM 2006 Workshop

ExperimentsExperiments

Use C4.5 to determine relationship between Use C4.5 to determine relationship between accuracy and training set sizeaccuracy and training set size 20 runs used to increase reliability of results20 runs used to increase reliability of results

Random sampling to reduce training set sizeRandom sampling to reduce training set size For this talk we focus on adult data setFor this talk we focus on adult data set

~ 21,000 examples~ 21,000 examples We utilize a predetermined sampling scheduleWe utilize a predetermined sampling schedule CPU times recorded, mainly for future workCPU times recorded, mainly for future work

Page 8: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 8UBDM 2006 Workshop

Measuring Total UtilityMeasuring Total Utility Total cost = Data Cost + Error CostTotal cost = Data Cost + Error Cost

= = nn∙C∙Ctrtr + + ee ∙|S| ∙C ∙|S| ∙Cerrerr

nn = number training examples = number training examplesee = error rate = error rate|S| = number examples in score set|S| = number examples in score setCCtrtr = cost of a training example = cost of a training exampleCCerrerr = cost of an error = cost of an error

Will know Will know nn and and e e for any experimentfor any experiment With domain knowledge can estimate CWith domain knowledge can estimate C tr, tr, CCerr, err, |S||S| But we don’t have this knowledgeBut we don’t have this knowledge

Treat CTreat Ctr tr andand CCerr err as parameters and vary themas parameters and vary them Assume |S| = 100 with no loss of generalityAssume |S| = 100 with no loss of generality

If |S| is 100,000 then look at results for CIf |S| is 100,000 then look at results for Cerrerr/1,000/1,000

Page 9: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 9UBDM 2006 Workshop

Measuring Total Utility (cont.)Measuring Total Utility (cont.) Now only look at cost ratio, CNow only look at cost ratio, Ctrtr:C:Cerrerr

Typical values evaluated: 1:1, 1:1000, etc.Typical values evaluated: 1:1, 1:1000, etc. Relative cost ratio is CRelative cost ratio is Cerrerr/C/Ctrtr

ExampleExample If cost ratio is 1:1000 then even trade-off if buying If cost ratio is 1:1000 then even trade-off if buying

1000 training examples eliminates 1 error1000 training examples eliminates 1 error Alternatively: buying 1000 examples is worth a 1% Alternatively: buying 1000 examples is worth a 1%

reduction in error rate (then can ignore |S| = 100)reduction in error rate (then can ignore |S| = 100)

Page 10: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 10UBDM 2006 Workshop

OutlineOutline

IntroductionIntroduction Motivation, cost modelMotivation, cost model

Experimental MethodologyExperimental Methodology ResultsResults

Adult data setAdult data set Progressive SamplingProgressive Sampling Related WorkRelated Work Future Work/ConclusionFuture Work/Conclusion

Page 11: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 11UBDM 2006 Workshop

Learning CurveLearning Curve

75

78

81

84

87

0 3,000 6,000 9,000 12,000 15,000

Training Set Size

Acc

urac

y (%

)

No plateauchange = 0.3%

Page 12: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 12UBDM 2006 Workshop

Utility CurvesUtility Curves

0

30,000

60,000

90,000

120,000

150,000

180,000

0 4,000 8,000 12,000 16,000

Training Set Size

Tot

al C

ost

10:1

1:7500

1:1

1:1000

1:3000

1:5000

Page 13: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 13UBDM 2006 Workshop

Utility Curves (Normalized Cost)Utility Curves (Normalized Cost)

0%

20%

40%

60%

80%

100%

0 4,000 8,000 12,000 16,000

Training Set Size

Nor

mal

ized

Cos

t

1:10

1:5000

1:1000 1:50,000

1:5001:100

Page 14: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 14UBDM 2006 Workshop

Optimal Training Set Size CurveOptimal Training Set Size Curve

0

3,000

6,000

9,000

12,000

15,000

0 10,000 20,000 30,000 40,000

Relative Cost

Opt

imal

Tra

inin

g S

et S

ize 85.8%

85.6%

85.4%

85.1%84.8%

Note: accuracy shown near data point

85.9%

Page 15: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 15UBDM 2006 Workshop

Value of Optimal CurveValue of Optimal Curve

Even without specific cost information, this Even without specific cost information, this chart could be useful for a practitionerchart could be useful for a practitioner Can put bounds on appropriate training set sizeCan put bounds on appropriate training set size Analogous to Drummond and Holte’s cost curvesAnalogous to Drummond and Holte’s cost curves33

They looked at cost ratio of false positives and negativesThey looked at cost ratio of false positives and negatives We look at cost ratio of errors vs. cost of dataWe look at cost ratio of errors vs. cost of data

Both types of curves allows the practitioner to Both types of curves allows the practitioner to understand the impact of the various costsunderstand the impact of the various costs

Page 16: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 16UBDM 2006 Workshop

Idealized learning curveIdealized learning curve

80

90

100

0 1,000 2,000 3,000 4,000 5,000

Training Set Size

Accu

racy

accuracy = training size/training size + 1

0

20

40

60

80

100

0 20 40 60 80 100

MillionsRelative Cost

Op

tim

al T

rain

ing

Set

Siz

e (

K)

optimal = 10sqroot(RC) -1

Page 17: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 17UBDM 2006 Workshop

OutlineOutline

IntroductionIntroduction Motivation, cost modelMotivation, cost model

Experimental MethodologyExperimental Methodology ResultsResults

Adult data setAdult data set Progressive SamplingProgressive Sampling Related WorkRelated Work Future Work/ConclusionFuture Work/Conclusion

Page 18: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 18UBDM 2006 Workshop

Progressive SamplingProgressive Sampling

We want to find the optimal training set sizeWe want to find the optimal training set size Need to determine when to stop acquiring data Need to determine when to stop acquiring data

beforebefore acquiring all of it! acquiring all of it! Strategy: use a progressive sampling strategyStrategy: use a progressive sampling strategy

Key issues:Key issues: When do we stop?When do we stop? What sampling schedule should we use?What sampling schedule should we use?

Page 19: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 19UBDM 2006 Workshop

Our Progressive Sampling StrategyOur Progressive Sampling Strategy

We stop We stop afterafter first increase in total cost first increase in total cost Results therefore never optimal, but near-optimal Results therefore never optimal, but near-optimal

if learning curve is non-decreasingif learning curve is non-decreasing

We evaluate 2 simple sampling schedulesWe evaluate 2 simple sampling schedules S1: 10, 50, 100, 500, 1000, 2000, …, 9000, S1: 10, 50, 100, 500, 1000, 2000, …, 9000,

10,000, 12,000, 14,000, …10,000, 12,000, 14,000, … S2: 50, 100, 200, 400, 800, 1600, …S2: 50, 100, 200, 400, 800, 1600, … S2 & S1 are similar given modest sized data setsS2 & S1 are similar given modest sized data sets Could use an adaptive strategyCould use an adaptive strategy

Page 20: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 20UBDM 2006 Workshop

Adult Data Set: S1 vs. Straw ManAdult Data Set: S1 vs. Straw Man

0.0

0.5

1.0

1.5

1:1 1:20 1:500 1:1000 1:5000 1:10000

Cost Ratio

To

tal C

ost

(1

00

K)

S1 Strategy

Straw Man Strategy

Page 21: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 21UBDM 2006 Workshop

Progressive Sampling ConclusionsProgressive Sampling Conclusions

We can use progressive sampling to We can use progressive sampling to determine a near optimal training set sizedetermine a near optimal training set size Effectiveness mainly based on how well behaved Effectiveness mainly based on how well behaved

the learning curve is (i.e., non-decreasing)the learning curve is (i.e., non-decreasing) Sampling schedule/batch size is also importantSampling schedule/batch size is also important Finer granularity requires more CPU timeFiner granularity requires more CPU time

But if data costly, CPU time most likely less expensiveBut if data costly, CPU time most likely less expensive In our experiments, cumulative CPU time < 1 minuteIn our experiments, cumulative CPU time < 1 minute

Page 22: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 22UBDM 2006 Workshop

Related WorkRelated Work Efficient progressive samplingEfficient progressive sampling22

It tries to efficiently find the asymptoteIt tries to efficiently find the asymptote That work has a data cost of That work has a data cost of εε

Stop only when added data has no benefitStop only when added data has no benefit Active LearningActive Learning

Similar in that data cost is factored in but setting differentSimilar in that data cost is factored in but setting different User has control over which examples are selected User has control over which examples are selected

or features measuredor features measured Does not address simple “cost of cases” scenarioDoes not address simple “cost of cases” scenario

Find best class distribution when training data costlyFind best class distribution when training data costly44

Assumes training set size limited but size pre-specifiedAssumes training set size limited but size pre-specified Finds the best class distribution to maximize performanceFinds the best class distribution to maximize performance

Page 23: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 23UBDM 2006 Workshop

Limitations/Future WorkLimitations/Future Work

Improvements:Improvements: Bigger data sets where learning curve plateausBigger data sets where learning curve plateaus More sophisticated sampling schemesMore sophisticated sampling schemes Incorporate cost-sensitive learning (cost FP Incorporate cost-sensitive learning (cost FP ≠ FN)≠ FN) Generate better behaved learning curvesGenerate better behaved learning curves Include CPU time in utility metricInclude CPU time in utility metric Analyze other cost modelsAnalyze other cost models Study the learning curvesStudy the learning curves Real world motivating examplesReal world motivating examples

Perhaps with cost informationPerhaps with cost information

Page 24: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 24UBDM 2006 Workshop

ConclusionConclusion

We analyze impact of training data cost on We analyze impact of training data cost on classification processclassification process

Introduce new ways of visualizing the impact Introduce new ways of visualizing the impact of data costof data cost Utility curvesUtility curves Optimal training set size curvesOptimal training set size curves

Show that we can use progressive sampling Show that we can use progressive sampling to help learn a near-optimal classifierto help learn a near-optimal classifier

Page 25: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 25UBDM 2006 Workshop

We Want FeedbackWe Want Feedback

We are continuing this workWe are continuing this work Clearly many minor enhancements possibleClearly many minor enhancements possible

Feel free to suggest some moreFeel free to suggest some more

Any major new directions/extensions?Any major new directions/extensions? What if anything is most interesting?What if anything is most interesting? Any really good motivating examples that you are Any really good motivating examples that you are

familiar withfamiliar with

Page 26: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 26UBDM 2006 Workshop

Questions?Questions?

If I have run out of time, please find me If I have run out of time, please find me during the break!!during the break!!

Page 27: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 27UBDM 2006 Workshop

ReferencesReferences

1.1. P. Turney (2000). Types of cost in inductive concept learning. P. Turney (2000). Types of cost in inductive concept learning. Workshop on Cost-Sensitive Learning at the 17Workshop on Cost-Sensitive Learning at the 17 thth International International Conference on Machine Learning.Conference on Machine Learning.

2.2. F. Provost, D. Jensen & T. Oates (1999). Proceedings of the 5F. Provost, D. Jensen & T. Oates (1999). Proceedings of the 5 thth International Conference on Knowledge Discovery and Data International Conference on Knowledge Discovery and Data Mining.Mining.

3.3. C. Drummond & R. Holte (2000). Explicitly Representing Expected C. Drummond & R. Holte (2000). Explicitly Representing Expected Cost: An Alternative to ROC Representation. Proceedings of the Cost: An Alternative to ROC Representation. Proceedings of the 66thth ACM SIGKDD International Conference of Knowledge ACM SIGKDD International Conference of Knowledge Discovery and Data Mining, 198-207.Discovery and Data Mining, 198-207.

4.4. G. Weiss & F. Provost (2003). Learning when Training Data are G. Weiss & F. Provost (2003). Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction, Journal Costly: The Effect of Class Distribution on Tree Induction, Journal of Artificial Intelligence Research, 19:315-354.of Artificial Intelligence Research, 19:315-354.

Page 28: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 28UBDM 2006 Workshop

Learning Curves for Large Data SetsLearning Curves for Large Data Sets

50

60

70

80

90

0 3,000 6,000 9,000 12,000 15,000

Training Set Size

Acc

urac

y (%

)

network1

blackjack

coding

adult

boa1

Page 29: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 29UBDM 2006 Workshop

Optimal Curves for Large Data SetsOptimal Curves for Large Data Sets

0

3,000

6,000

9,000

12,000

15,000

0 5,000 10,000 15,000 20,000Relative Cost

Op

tim

al T

rain

ing

Se

t S

ize

coding

network1 boa1

blackjack

Page 30: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 30UBDM 2006 Workshop

Learning Curves for Small Data SetsLearning Curves for Small Data Sets

50

60

70

80

90

100

0 500 1,000 1,500 2,000 2,500

Training Set Size

Acc

urac

y (%

) breast-wisc

german

move

kr-vs-kp

crx

Page 31: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 31UBDM 2006 Workshop

Optimal Curves for Small Data SetsOptimal Curves for Small Data Sets

0

500

1,000

1,500

2,000

2,500

0 500 1,000 1,500 2,000 2,500 3,000 3,500

Relative Cost

Op

timu

m T

rain

ing

Se

t S

ize

move

kr-vs-kp

german

crxbreast-wisc

Page 32: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 32UBDM 2006 Workshop

Results for Adult Data SetResults for Adult Data Set

RelativeCost Ratio Size Cost CPU Size Cost CPU Size Cost CPU

1 10 34 0.00 50 74 0.00 100 122 0.0010 10 25 0.00 50 292 0.00 100 319 0.0020 500 2,233 0.20 50 2,470 0.00 100 538 0.00

200 500 3,966 0.20 1,000 4,266 0.53 800 4,060 0.40500 500 9,165 0.20 2,000 9,945 1.23 1,600 9,480 0.92

5,000 5,000 79,450 4.17 6,000 79,800 5.27 12,800 83,700 14.8410,000 9,000 152,900 9.15 7,000 154,700 6.48 12,800 154,600 14.8415,000 9,000 224,850 9.15 7,000 228,550 6.48 15,960 226,860 20.8820,000 9,000 296,800 9.15 7,000 302,400 6.48 15,960 297,160 20.8850,000 15,960 721,460 20.89 7,000 745,500 6.48 15,960 718,960 20.88

Optimal-S1 S1 S2

Page 33: Maximizing Classifier Utility when Training Data is Costly

August 20, 2006 33UBDM 2006 Workshop

Optimal vs. S1 for Large Data SetsOptimal vs. S1 for Large Data Sets

Relative Cost Ratio Adult Blackjack Boa1 Coding Network1

1 115.7% 53.2% 70.1% 62.8% 91.0%20 10.6% 34.6% 5.1% 2.0% 0.7%

500 8.5% 1.0% 1.2% 2.1% 2.7%1,000 3.2% 2.6% 2.3% 0.6% 3.6%5,000 0.4% 1.4% 4.7% 0.2% 1.5%

10,000 1.2% 1.1% 5.9% 0.0% 1.3%15,000 1.6% 1.6% 6.3% 0.0% 1.2%20,000 1.9% 1.9% 6.5% 0.0% 1.1%50,000 3.3% 0.7% 6.9% 0.0% 1.0%

Increase In Total Cost: S1 vs. S1-optimal