Maximizing Classifier Utility Maximizing Classifier Utility when Training Data is Costlywhen Training Data is Costly
Gary M. WeissGary M. WeissYe TianYe Tian
Fordham UniversityFordham University
August 20, 2006 2UBDM 2006 Workshop
OutlineOutline
IntroductionIntroduction Motivation, cost modelMotivation, cost model
Experimental MethodologyExperimental Methodology ResultsResults
Adult data setAdult data set Progressive SamplingProgressive Sampling Related WorkRelated Work Future Work/ConclusionFuture Work/Conclusion
August 20, 2006 3UBDM 2006 Workshop
MotivationMotivation Utility-Based Data MiningUtility-Based Data Mining
Concerned with utility of overall data mining processConcerned with utility of overall data mining process A key cost is the cost of training dataA key cost is the cost of training data
These costs often ignored (except for active learning)These costs often ignored (except for active learning) First ones to analyze the impact of a very simple cost modelFirst ones to analyze the impact of a very simple cost model
In doing so we fill a hole in existing researchIn doing so we fill a hole in existing research
Our cost modelOur cost model A fixed cost for acquiring labeled training examplesA fixed cost for acquiring labeled training examples
No separate cost for class labels, missing features, etc.No separate cost for class labels, missing features, etc. TurneyTurney11 called this the “cost of cases” called this the “cost of cases” No control over which training examples chosenNo control over which training examples chosen
No active learningNo active learning
August 20, 2006 4UBDM 2006 Workshop
Motivation (cont.)Motivation (cont.)
Efficient progressive samplingEfficient progressive sampling22
Determines “optimal” training set sizeDetermines “optimal” training set size Optimal is where the learning curve reaches a plateauOptimal is where the learning curve reaches a plateau Assumes data acquisition costs are essentially zeroAssumes data acquisition costs are essentially zero
What if the acquisition costs are significant?What if the acquisition costs are significant?
August 20, 2006 5UBDM 2006 Workshop
Motivating ExamplesMotivating Examples
Predicting customer behavior/buying potentialPredicting customer behavior/buying potential Training data from D&B and Ziff-DavisTraining data from D&B and Ziff-Davis These and other “information vendors” make These and other “information vendors” make
money by selling informationmoney by selling information Poker playingPoker playing
Learn about an opponent by playing himLearn about an opponent by playing him
August 20, 2006 6UBDM 2006 Workshop
OutlineOutline
IntroductionIntroduction Motivation, cost modelMotivation, cost model
Experimental MethodologyExperimental Methodology ResultsResults
Adult data setAdult data set Progressive SamplingProgressive Sampling Related WorkRelated Work Future Work/ConclusionFuture Work/Conclusion
August 20, 2006 7UBDM 2006 Workshop
ExperimentsExperiments
Use C4.5 to determine relationship between Use C4.5 to determine relationship between accuracy and training set sizeaccuracy and training set size 20 runs used to increase reliability of results20 runs used to increase reliability of results
Random sampling to reduce training set sizeRandom sampling to reduce training set size For this talk we focus on adult data setFor this talk we focus on adult data set
~ 21,000 examples~ 21,000 examples We utilize a predetermined sampling scheduleWe utilize a predetermined sampling schedule CPU times recorded, mainly for future workCPU times recorded, mainly for future work
August 20, 2006 8UBDM 2006 Workshop
Measuring Total UtilityMeasuring Total Utility Total cost = Data Cost + Error CostTotal cost = Data Cost + Error Cost
= = nn∙C∙Ctrtr + + ee ∙|S| ∙C ∙|S| ∙Cerrerr
nn = number training examples = number training examplesee = error rate = error rate|S| = number examples in score set|S| = number examples in score setCCtrtr = cost of a training example = cost of a training exampleCCerrerr = cost of an error = cost of an error
Will know Will know nn and and e e for any experimentfor any experiment With domain knowledge can estimate CWith domain knowledge can estimate C tr, tr, CCerr, err, |S||S| But we don’t have this knowledgeBut we don’t have this knowledge
Treat CTreat Ctr tr andand CCerr err as parameters and vary themas parameters and vary them Assume |S| = 100 with no loss of generalityAssume |S| = 100 with no loss of generality
If |S| is 100,000 then look at results for CIf |S| is 100,000 then look at results for Cerrerr/1,000/1,000
August 20, 2006 9UBDM 2006 Workshop
Measuring Total Utility (cont.)Measuring Total Utility (cont.) Now only look at cost ratio, CNow only look at cost ratio, Ctrtr:C:Cerrerr
Typical values evaluated: 1:1, 1:1000, etc.Typical values evaluated: 1:1, 1:1000, etc. Relative cost ratio is CRelative cost ratio is Cerrerr/C/Ctrtr
ExampleExample If cost ratio is 1:1000 then even trade-off if buying If cost ratio is 1:1000 then even trade-off if buying
1000 training examples eliminates 1 error1000 training examples eliminates 1 error Alternatively: buying 1000 examples is worth a 1% Alternatively: buying 1000 examples is worth a 1%
reduction in error rate (then can ignore |S| = 100)reduction in error rate (then can ignore |S| = 100)
August 20, 2006 10UBDM 2006 Workshop
OutlineOutline
IntroductionIntroduction Motivation, cost modelMotivation, cost model
Experimental MethodologyExperimental Methodology ResultsResults
Adult data setAdult data set Progressive SamplingProgressive Sampling Related WorkRelated Work Future Work/ConclusionFuture Work/Conclusion
August 20, 2006 11UBDM 2006 Workshop
Learning CurveLearning Curve
75
78
81
84
87
0 3,000 6,000 9,000 12,000 15,000
Training Set Size
Acc
urac
y (%
)
No plateauchange = 0.3%
August 20, 2006 12UBDM 2006 Workshop
Utility CurvesUtility Curves
0
30,000
60,000
90,000
120,000
150,000
180,000
0 4,000 8,000 12,000 16,000
Training Set Size
Tot
al C
ost
10:1
1:7500
1:1
1:1000
1:3000
1:5000
August 20, 2006 13UBDM 2006 Workshop
Utility Curves (Normalized Cost)Utility Curves (Normalized Cost)
0%
20%
40%
60%
80%
100%
0 4,000 8,000 12,000 16,000
Training Set Size
Nor
mal
ized
Cos
t
1:10
1:5000
1:1000 1:50,000
1:5001:100
August 20, 2006 14UBDM 2006 Workshop
Optimal Training Set Size CurveOptimal Training Set Size Curve
0
3,000
6,000
9,000
12,000
15,000
0 10,000 20,000 30,000 40,000
Relative Cost
Opt
imal
Tra
inin
g S
et S
ize 85.8%
85.6%
85.4%
85.1%84.8%
Note: accuracy shown near data point
85.9%
August 20, 2006 15UBDM 2006 Workshop
Value of Optimal CurveValue of Optimal Curve
Even without specific cost information, this Even without specific cost information, this chart could be useful for a practitionerchart could be useful for a practitioner Can put bounds on appropriate training set sizeCan put bounds on appropriate training set size Analogous to Drummond and Holte’s cost curvesAnalogous to Drummond and Holte’s cost curves33
They looked at cost ratio of false positives and negativesThey looked at cost ratio of false positives and negatives We look at cost ratio of errors vs. cost of dataWe look at cost ratio of errors vs. cost of data
Both types of curves allows the practitioner to Both types of curves allows the practitioner to understand the impact of the various costsunderstand the impact of the various costs
August 20, 2006 16UBDM 2006 Workshop
Idealized learning curveIdealized learning curve
80
90
100
0 1,000 2,000 3,000 4,000 5,000
Training Set Size
Accu
racy
accuracy = training size/training size + 1
0
20
40
60
80
100
0 20 40 60 80 100
MillionsRelative Cost
Op
tim
al T
rain
ing
Set
Siz
e (
K)
optimal = 10sqroot(RC) -1
August 20, 2006 17UBDM 2006 Workshop
OutlineOutline
IntroductionIntroduction Motivation, cost modelMotivation, cost model
Experimental MethodologyExperimental Methodology ResultsResults
Adult data setAdult data set Progressive SamplingProgressive Sampling Related WorkRelated Work Future Work/ConclusionFuture Work/Conclusion
August 20, 2006 18UBDM 2006 Workshop
Progressive SamplingProgressive Sampling
We want to find the optimal training set sizeWe want to find the optimal training set size Need to determine when to stop acquiring data Need to determine when to stop acquiring data
beforebefore acquiring all of it! acquiring all of it! Strategy: use a progressive sampling strategyStrategy: use a progressive sampling strategy
Key issues:Key issues: When do we stop?When do we stop? What sampling schedule should we use?What sampling schedule should we use?
August 20, 2006 19UBDM 2006 Workshop
Our Progressive Sampling StrategyOur Progressive Sampling Strategy
We stop We stop afterafter first increase in total cost first increase in total cost Results therefore never optimal, but near-optimal Results therefore never optimal, but near-optimal
if learning curve is non-decreasingif learning curve is non-decreasing
We evaluate 2 simple sampling schedulesWe evaluate 2 simple sampling schedules S1: 10, 50, 100, 500, 1000, 2000, …, 9000, S1: 10, 50, 100, 500, 1000, 2000, …, 9000,
10,000, 12,000, 14,000, …10,000, 12,000, 14,000, … S2: 50, 100, 200, 400, 800, 1600, …S2: 50, 100, 200, 400, 800, 1600, … S2 & S1 are similar given modest sized data setsS2 & S1 are similar given modest sized data sets Could use an adaptive strategyCould use an adaptive strategy
August 20, 2006 20UBDM 2006 Workshop
Adult Data Set: S1 vs. Straw ManAdult Data Set: S1 vs. Straw Man
0.0
0.5
1.0
1.5
1:1 1:20 1:500 1:1000 1:5000 1:10000
Cost Ratio
To
tal C
ost
(1
00
K)
S1 Strategy
Straw Man Strategy
August 20, 2006 21UBDM 2006 Workshop
Progressive Sampling ConclusionsProgressive Sampling Conclusions
We can use progressive sampling to We can use progressive sampling to determine a near optimal training set sizedetermine a near optimal training set size Effectiveness mainly based on how well behaved Effectiveness mainly based on how well behaved
the learning curve is (i.e., non-decreasing)the learning curve is (i.e., non-decreasing) Sampling schedule/batch size is also importantSampling schedule/batch size is also important Finer granularity requires more CPU timeFiner granularity requires more CPU time
But if data costly, CPU time most likely less expensiveBut if data costly, CPU time most likely less expensive In our experiments, cumulative CPU time < 1 minuteIn our experiments, cumulative CPU time < 1 minute
August 20, 2006 22UBDM 2006 Workshop
Related WorkRelated Work Efficient progressive samplingEfficient progressive sampling22
It tries to efficiently find the asymptoteIt tries to efficiently find the asymptote That work has a data cost of That work has a data cost of εε
Stop only when added data has no benefitStop only when added data has no benefit Active LearningActive Learning
Similar in that data cost is factored in but setting differentSimilar in that data cost is factored in but setting different User has control over which examples are selected User has control over which examples are selected
or features measuredor features measured Does not address simple “cost of cases” scenarioDoes not address simple “cost of cases” scenario
Find best class distribution when training data costlyFind best class distribution when training data costly44
Assumes training set size limited but size pre-specifiedAssumes training set size limited but size pre-specified Finds the best class distribution to maximize performanceFinds the best class distribution to maximize performance
August 20, 2006 23UBDM 2006 Workshop
Limitations/Future WorkLimitations/Future Work
Improvements:Improvements: Bigger data sets where learning curve plateausBigger data sets where learning curve plateaus More sophisticated sampling schemesMore sophisticated sampling schemes Incorporate cost-sensitive learning (cost FP Incorporate cost-sensitive learning (cost FP ≠ FN)≠ FN) Generate better behaved learning curvesGenerate better behaved learning curves Include CPU time in utility metricInclude CPU time in utility metric Analyze other cost modelsAnalyze other cost models Study the learning curvesStudy the learning curves Real world motivating examplesReal world motivating examples
Perhaps with cost informationPerhaps with cost information
August 20, 2006 24UBDM 2006 Workshop
ConclusionConclusion
We analyze impact of training data cost on We analyze impact of training data cost on classification processclassification process
Introduce new ways of visualizing the impact Introduce new ways of visualizing the impact of data costof data cost Utility curvesUtility curves Optimal training set size curvesOptimal training set size curves
Show that we can use progressive sampling Show that we can use progressive sampling to help learn a near-optimal classifierto help learn a near-optimal classifier
August 20, 2006 25UBDM 2006 Workshop
We Want FeedbackWe Want Feedback
We are continuing this workWe are continuing this work Clearly many minor enhancements possibleClearly many minor enhancements possible
Feel free to suggest some moreFeel free to suggest some more
Any major new directions/extensions?Any major new directions/extensions? What if anything is most interesting?What if anything is most interesting? Any really good motivating examples that you are Any really good motivating examples that you are
familiar withfamiliar with
August 20, 2006 26UBDM 2006 Workshop
Questions?Questions?
If I have run out of time, please find me If I have run out of time, please find me during the break!!during the break!!
August 20, 2006 27UBDM 2006 Workshop
ReferencesReferences
1.1. P. Turney (2000). Types of cost in inductive concept learning. P. Turney (2000). Types of cost in inductive concept learning. Workshop on Cost-Sensitive Learning at the 17Workshop on Cost-Sensitive Learning at the 17 thth International International Conference on Machine Learning.Conference on Machine Learning.
2.2. F. Provost, D. Jensen & T. Oates (1999). Proceedings of the 5F. Provost, D. Jensen & T. Oates (1999). Proceedings of the 5 thth International Conference on Knowledge Discovery and Data International Conference on Knowledge Discovery and Data Mining.Mining.
3.3. C. Drummond & R. Holte (2000). Explicitly Representing Expected C. Drummond & R. Holte (2000). Explicitly Representing Expected Cost: An Alternative to ROC Representation. Proceedings of the Cost: An Alternative to ROC Representation. Proceedings of the 66thth ACM SIGKDD International Conference of Knowledge ACM SIGKDD International Conference of Knowledge Discovery and Data Mining, 198-207.Discovery and Data Mining, 198-207.
4.4. G. Weiss & F. Provost (2003). Learning when Training Data are G. Weiss & F. Provost (2003). Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction, Journal Costly: The Effect of Class Distribution on Tree Induction, Journal of Artificial Intelligence Research, 19:315-354.of Artificial Intelligence Research, 19:315-354.
August 20, 2006 28UBDM 2006 Workshop
Learning Curves for Large Data SetsLearning Curves for Large Data Sets
50
60
70
80
90
0 3,000 6,000 9,000 12,000 15,000
Training Set Size
Acc
urac
y (%
)
network1
blackjack
coding
adult
boa1
August 20, 2006 29UBDM 2006 Workshop
Optimal Curves for Large Data SetsOptimal Curves for Large Data Sets
0
3,000
6,000
9,000
12,000
15,000
0 5,000 10,000 15,000 20,000Relative Cost
Op
tim
al T
rain
ing
Se
t S
ize
coding
network1 boa1
blackjack
August 20, 2006 30UBDM 2006 Workshop
Learning Curves for Small Data SetsLearning Curves for Small Data Sets
50
60
70
80
90
100
0 500 1,000 1,500 2,000 2,500
Training Set Size
Acc
urac
y (%
) breast-wisc
german
move
kr-vs-kp
crx
August 20, 2006 31UBDM 2006 Workshop
Optimal Curves for Small Data SetsOptimal Curves for Small Data Sets
0
500
1,000
1,500
2,000
2,500
0 500 1,000 1,500 2,000 2,500 3,000 3,500
Relative Cost
Op
timu
m T
rain
ing
Se
t S
ize
move
kr-vs-kp
german
crxbreast-wisc
August 20, 2006 32UBDM 2006 Workshop
Results for Adult Data SetResults for Adult Data Set
RelativeCost Ratio Size Cost CPU Size Cost CPU Size Cost CPU
1 10 34 0.00 50 74 0.00 100 122 0.0010 10 25 0.00 50 292 0.00 100 319 0.0020 500 2,233 0.20 50 2,470 0.00 100 538 0.00
200 500 3,966 0.20 1,000 4,266 0.53 800 4,060 0.40500 500 9,165 0.20 2,000 9,945 1.23 1,600 9,480 0.92
5,000 5,000 79,450 4.17 6,000 79,800 5.27 12,800 83,700 14.8410,000 9,000 152,900 9.15 7,000 154,700 6.48 12,800 154,600 14.8415,000 9,000 224,850 9.15 7,000 228,550 6.48 15,960 226,860 20.8820,000 9,000 296,800 9.15 7,000 302,400 6.48 15,960 297,160 20.8850,000 15,960 721,460 20.89 7,000 745,500 6.48 15,960 718,960 20.88
Optimal-S1 S1 S2
August 20, 2006 33UBDM 2006 Workshop
Optimal vs. S1 for Large Data SetsOptimal vs. S1 for Large Data Sets
Relative Cost Ratio Adult Blackjack Boa1 Coding Network1
1 115.7% 53.2% 70.1% 62.8% 91.0%20 10.6% 34.6% 5.1% 2.0% 0.7%
500 8.5% 1.0% 1.2% 2.1% 2.7%1,000 3.2% 2.6% 2.3% 0.6% 3.6%5,000 0.4% 1.4% 4.7% 0.2% 1.5%
10,000 1.2% 1.1% 5.9% 0.0% 1.3%15,000 1.6% 1.6% 6.3% 0.0% 1.2%20,000 1.9% 1.9% 6.5% 0.0% 1.1%50,000 3.3% 0.7% 6.9% 0.0% 1.0%
Increase In Total Cost: S1 vs. S1-optimal
Top Related