Maximizing Classifier Utility when Training Data is Costly

Maximizing Classifier Utility Maximizing Classifier Utility when Training Data is Costlywhen Training Data is Costly

Gary M. WeissGary M. WeissYe TianYe Tian

Fordham UniversityFordham University

August 20, 2006 2UBDM 2006 Workshop

OutlineOutline

IntroductionIntroduction Motivation, cost modelMotivation, cost model

Experimental MethodologyExperimental Methodology ResultsResults

Adult data setAdult data set Progressive SamplingProgressive Sampling Related WorkRelated Work Future Work/ConclusionFuture Work/Conclusion


MotivationMotivation Utility-Based Data MiningUtility-Based Data Mining

Concerned with utility of overall data mining processConcerned with utility of overall data mining process A key cost is the cost of training dataA key cost is the cost of training data

These costs often ignored (except for active learning)These costs often ignored (except for active learning) First ones to analyze the impact of a very simple cost modelFirst ones to analyze the impact of a very simple cost model

In doing so we fill a hole in existing researchIn doing so we fill a hole in existing research

Our cost modelOur cost model A fixed cost for acquiring labeled training examplesA fixed cost for acquiring labeled training examples

No separate cost for class labels, missing features, etc.No separate cost for class labels, missing features, etc. TurneyTurney11 called this the “cost of cases” called this the “cost of cases” No control over which training examples chosenNo control over which training examples chosen

No active learningNo active learning


Motivation (cont.)Motivation (cont.)

Efficient progressive samplingEfficient progressive sampling22

Determines “optimal” training set sizeDetermines “optimal” training set size Optimal is where the learning curve reaches a plateauOptimal is where the learning curve reaches a plateau Assumes data acquisition costs are essentially zeroAssumes data acquisition costs are essentially zero

What if the acquisition costs are significant?What if the acquisition costs are significant?


Motivating ExamplesMotivating Examples

Predicting customer behavior/buying potentialPredicting customer behavior/buying potential Training data from D&B and Ziff-DavisTraining data from D&B and Ziff-Davis These and other “information vendors” make These and other “information vendors” make

money by selling informationmoney by selling information Poker playingPoker playing

Learn about an opponent by playing himLearn about an opponent by playing him


OutlineOutline





ExperimentsExperiments

Use C4.5 to determine relationship between Use C4.5 to determine relationship between accuracy and training set sizeaccuracy and training set size 20 runs used to increase reliability of results20 runs used to increase reliability of results

Random sampling to reduce training set sizeRandom sampling to reduce training set size For this talk we focus on adult data setFor this talk we focus on adult data set

~ 21,000 examples~ 21,000 examples We utilize a predetermined sampling scheduleWe utilize a predetermined sampling schedule CPU times recorded, mainly for future workCPU times recorded, mainly for future work


Measuring Total UtilityMeasuring Total Utility Total cost = Data Cost + Error CostTotal cost = Data Cost + Error Cost

= = nn∙C∙Ctrtr + + ee ∙|S| ∙C ∙|S| ∙Cerrerr

nn = number training examples = number training examplesee = error rate = error rate|S| = number examples in score set|S| = number examples in score setCCtrtr = cost of a training example = cost of a training exampleCCerrerr = cost of an error = cost of an error

Will know Will know nn and and e e for any experimentfor any experiment With domain knowledge can estimate CWith domain knowledge can estimate C tr, tr, CCerr, err, |S||S| But we don’t have this knowledgeBut we don’t have this knowledge

Treat CTreat Ctr tr andand CCerr err as parameters and vary themas parameters and vary them Assume |S| = 100 with no loss of generalityAssume |S| = 100 with no loss of generality

If |S| is 100,000 then look at results for CIf |S| is 100,000 then look at results for Cerrerr/1,000/1,000


Measuring Total Utility (cont.)Measuring Total Utility (cont.) Now only look at cost ratio, CNow only look at cost ratio, Ctrtr:C:Cerrerr

Typical values evaluated: 1:1, 1:1000, etc.Typical values evaluated: 1:1, 1:1000, etc. Relative cost ratio is CRelative cost ratio is Cerrerr/C/Ctrtr

ExampleExample If cost ratio is 1:1000 then even trade-off if buying If cost ratio is 1:1000 then even trade-off if buying

1000 training examples eliminates 1 error1000 training examples eliminates 1 error Alternatively: buying 1000 examples is worth a 1% Alternatively: buying 1000 examples is worth a 1%

reduction in error rate (then can ignore |S| = 100)reduction in error rate (then can ignore |S| = 100)


OutlineOutline





Learning CurveLearning Curve

75

78

81

84

87

0 3,000 6,000 9,000 12,000 15,000

Training Set Size

Acc

urac

y (%

)

No plateauchange = 0.3%


Utility CurvesUtility Curves

0

30,000

60,000

90,000

120,000

150,000

180,000

0 4,000 8,000 12,000 16,000

Training Set Size

Tot

al C

ost

10:1

1:7500

1:1

1:1000

1:3000

1:5000


Utility Curves (Normalized Cost)Utility Curves (Normalized Cost)

0%

20%

40%

60%

80%

100%

0 4,000 8,000 12,000 16,000

Training Set Size

Nor

mal

ized

Cos

t

1:10

1:5000

1:1000 1:50,000

1:5001:100


Optimal Training Set Size CurveOptimal Training Set Size Curve

0

3,000

6,000

9,000

12,000

15,000

0 10,000 20,000 30,000 40,000

Relative Cost

Opt

imal

Tra

inin

g S

et S

ize 85.8%

85.6%

85.4%

85.1%84.8%

Note: accuracy shown near data point

85.9%


Value of Optimal CurveValue of Optimal Curve

Even without specific cost information, this Even without specific cost information, this chart could be useful for a practitionerchart could be useful for a practitioner Can put bounds on appropriate training set sizeCan put bounds on appropriate training set size Analogous to Drummond and Holte’s cost curvesAnalogous to Drummond and Holte’s cost curves33

They looked at cost ratio of false positives and negativesThey looked at cost ratio of false positives and negatives We look at cost ratio of errors vs. cost of dataWe look at cost ratio of errors vs. cost of data

Both types of curves allows the practitioner to Both types of curves allows the practitioner to understand the impact of the various costsunderstand the impact of the various costs


Idealized learning curveIdealized learning curve

80

90

100

0 1,000 2,000 3,000 4,000 5,000

Training Set Size

Accu

racy

accuracy = training size/training size + 1

0

20

40

60

80

100

0 20 40 60 80 100

MillionsRelative Cost

Op

tim

al T

rain

ing

Set

Siz

e (

K)

optimal = 10sqroot(RC) -1


OutlineOutline





Progressive SamplingProgressive Sampling

We want to find the optimal training set sizeWe want to find the optimal training set size Need to determine when to stop acquiring data Need to determine when to stop acquiring data

beforebefore acquiring all of it! acquiring all of it! Strategy: use a progressive sampling strategyStrategy: use a progressive sampling strategy

Key issues:Key issues: When do we stop?When do we stop? What sampling schedule should we use?What sampling schedule should we use?


Our Progressive Sampling StrategyOur Progressive Sampling Strategy

We stop We stop afterafter first increase in total cost first increase in total cost Results therefore never optimal, but near-optimal Results therefore never optimal, but near-optimal

if learning curve is non-decreasingif learning curve is non-decreasing

We evaluate 2 simple sampling schedulesWe evaluate 2 simple sampling schedules S1: 10, 50, 100, 500, 1000, 2000, …, 9000, S1: 10, 50, 100, 500, 1000, 2000, …, 9000,

10,000, 12,000, 14,000, …10,000, 12,000, 14,000, … S2: 50, 100, 200, 400, 800, 1600, …S2: 50, 100, 200, 400, 800, 1600, … S2 & S1 are similar given modest sized data setsS2 & S1 are similar given modest sized data sets Could use an adaptive strategyCould use an adaptive strategy


Adult Data Set: S1 vs. Straw ManAdult Data Set: S1 vs. Straw Man

0.0

0.5

1.0

1.5

1:1 1:20 1:500 1:1000 1:5000 1:10000

Cost Ratio

To

tal C

ost

(1

00

K)

S1 Strategy

Straw Man Strategy


Progressive Sampling ConclusionsProgressive Sampling Conclusions

We can use progressive sampling to We can use progressive sampling to determine a near optimal training set sizedetermine a near optimal training set size Effectiveness mainly based on how well behaved Effectiveness mainly based on how well behaved

the learning curve is (i.e., non-decreasing)the learning curve is (i.e., non-decreasing) Sampling schedule/batch size is also importantSampling schedule/batch size is also important Finer granularity requires more CPU timeFiner granularity requires more CPU time

But if data costly, CPU time most likely less expensiveBut if data costly, CPU time most likely less expensive In our experiments, cumulative CPU time < 1 minuteIn our experiments, cumulative CPU time < 1 minute


Related WorkRelated Work Efficient progressive samplingEfficient progressive sampling22

It tries to efficiently find the asymptoteIt tries to efficiently find the asymptote That work has a data cost of That work has a data cost of εε

Stop only when added data has no benefitStop only when added data has no benefit Active LearningActive Learning

Similar in that data cost is factored in but setting differentSimilar in that data cost is factored in but setting different User has control over which examples are selected User has control over which examples are selected

or features measuredor features measured Does not address simple “cost of cases” scenarioDoes not address simple “cost of cases” scenario

Find best class distribution when training data costlyFind best class distribution when training data costly44

Assumes training set size limited but size pre-specifiedAssumes training set size limited but size pre-specified Finds the best class distribution to maximize performanceFinds the best class distribution to maximize performance


Limitations/Future WorkLimitations/Future Work

Improvements:Improvements: Bigger data sets where learning curve plateausBigger data sets where learning curve plateaus More sophisticated sampling schemesMore sophisticated sampling schemes Incorporate cost-sensitive learning (cost FP Incorporate cost-sensitive learning (cost FP ≠ FN)≠ FN) Generate better behaved learning curvesGenerate better behaved learning curves Include CPU time in utility metricInclude CPU time in utility metric Analyze other cost modelsAnalyze other cost models Study the learning curvesStudy the learning curves Real world motivating examplesReal world motivating examples

Perhaps with cost informationPerhaps with cost information


ConclusionConclusion

We analyze impact of training data cost on We analyze impact of training data cost on classification processclassification process

Introduce new ways of visualizing the impact Introduce new ways of visualizing the impact of data costof data cost Utility curvesUtility curves Optimal training set size curvesOptimal training set size curves

Show that we can use progressive sampling Show that we can use progressive sampling to help learn a near-optimal classifierto help learn a near-optimal classifier


We Want FeedbackWe Want Feedback

We are continuing this workWe are continuing this work Clearly many minor enhancements possibleClearly many minor enhancements possible

Feel free to suggest some moreFeel free to suggest some more

Any major new directions/extensions?Any major new directions/extensions? What if anything is most interesting?What if anything is most interesting? Any really good motivating examples that you are Any really good motivating examples that you are

familiar withfamiliar with


Questions?Questions?

If I have run out of time, please find me If I have run out of time, please find me during the break!!during the break!!


ReferencesReferences

1.1. P. Turney (2000). Types of cost in inductive concept learning. P. Turney (2000). Types of cost in inductive concept learning. Workshop on Cost-Sensitive Learning at the 17Workshop on Cost-Sensitive Learning at the 17 thth International International Conference on Machine Learning.Conference on Machine Learning.

2.2. F. Provost, D. Jensen & T. Oates (1999). Proceedings of the 5F. Provost, D. Jensen & T. Oates (1999). Proceedings of the 5 thth International Conference on Knowledge Discovery and Data International Conference on Knowledge Discovery and Data Mining.Mining.

3.3. C. Drummond & R. Holte (2000). Explicitly Representing Expected C. Drummond & R. Holte (2000). Explicitly Representing Expected Cost: An Alternative to ROC Representation. Proceedings of the Cost: An Alternative to ROC Representation. Proceedings of the 66thth ACM SIGKDD International Conference of Knowledge ACM SIGKDD International Conference of Knowledge Discovery and Data Mining, 198-207.Discovery and Data Mining, 198-207.

4.4. G. Weiss & F. Provost (2003). Learning when Training Data are G. Weiss & F. Provost (2003). Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction, Journal Costly: The Effect of Class Distribution on Tree Induction, Journal of Artificial Intelligence Research, 19:315-354.of Artificial Intelligence Research, 19:315-354.


Learning Curves for Large Data SetsLearning Curves for Large Data Sets

50

60

70

80

90

0 3,000 6,000 9,000 12,000 15,000

Training Set Size

Acc

urac

y (%

)

network1

blackjack

coding

adult

boa1


Optimal Curves for Large Data SetsOptimal Curves for Large Data Sets

0

3,000

6,000

9,000

12,000

15,000

0 5,000 10,000 15,000 20,000Relative Cost

Op

tim

al T

rain

ing

Se

t S

ize

coding

network1 boa1

blackjack


Learning Curves for Small Data SetsLearning Curves for Small Data Sets

50

60

70

80

90

100

0 500 1,000 1,500 2,000 2,500

Training Set Size

Acc

urac

y (%

) breast-wisc

german

move

kr-vs-kp

crx


Optimal Curves for Small Data SetsOptimal Curves for Small Data Sets

0

500

1,000

1,500

2,000

2,500

0 500 1,000 1,500 2,000 2,500 3,000 3,500

Relative Cost

Op

timu

m T

rain

ing

Se

t S

ize

move

kr-vs-kp

german

crxbreast-wisc


Results for Adult Data SetResults for Adult Data Set

RelativeCost Ratio Size Cost CPU Size Cost CPU Size Cost CPU

1 10 34 0.00 50 74 0.00 100 122 0.0010 10 25 0.00 50 292 0.00 100 319 0.0020 500 2,233 0.20 50 2,470 0.00 100 538 0.00

200 500 3,966 0.20 1,000 4,266 0.53 800 4,060 0.40500 500 9,165 0.20 2,000 9,945 1.23 1,600 9,480 0.92

5,000 5,000 79,450 4.17 6,000 79,800 5.27 12,800 83,700 14.8410,000 9,000 152,900 9.15 7,000 154,700 6.48 12,800 154,600 14.8415,000 9,000 224,850 9.15 7,000 228,550 6.48 15,960 226,860 20.8820,000 9,000 296,800 9.15 7,000 302,400 6.48 15,960 297,160 20.8850,000 15,960 721,460 20.89 7,000 745,500 6.48 15,960 718,960 20.88

Optimal-S1 S1 S2


Optimal vs. S1 for Large Data SetsOptimal vs. S1 for Large Data Sets

Relative Cost Ratio Adult Blackjack Boa1 Coding Network1

1 115.7% 53.2% 70.1% 62.8% 91.0%20 10.6% 34.6% 5.1% 2.0% 0.7%

500 8.5% 1.0% 1.2% 2.1% 2.7%1,000 3.2% 2.6% 2.3% 0.6% 3.6%5,000 0.4% 1.4% 4.7% 0.2% 1.5%

10,000 1.2% 1.1% 5.9% 0.0% 1.3%15,000 1.6% 1.6% 6.3% 0.0% 1.2%20,000 1.9% 1.9% 6.5% 0.0% 1.1%50,000 3.3% 0.7% 6.9% 0.0% 1.0%

Increase In Total Cost: S1 vs. S1-optimal

Maximizing Classifier Utility when Training Data is Costly

Documents

Transcript of Maximizing Classifier Utility when Training Data is Costly