CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L3.pdf · Machine Learning and Data...
Transcript of CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L3.pdf · Machine Learning and Data...
CPSC340:MachineLearningandDataMining
DecisionTreesFall2020
Admin• Assignment1 isdueFriday:startearly.• Waitinglistpeople:youshouldberegisteredsoon-ish.
– Startontheassignmentnow,everybodycurrentlyonthewaitinglistwillgetin.• Gradescope
– Useaccesscode9PX4B9– Youmust [email protected] emailalias(aslistedinhttps://www.cs.ubc.ca/getacct)toidentifyyourselfonGradescope
– Failuretoproperlyidentifyyourselfwillresultinazeroforallhomeworkandexamsubmissionsmadeunderadifferentidentity
• Coursewebpage:https://www.cs.ubc.ca/~fwood/CS340/– SignupforPiazza.
• Tutorialsandofficehourshavealreadystarted(seewebpageforcalendar).
LastTime:DataRepresentationandExploration• Wediscussedexample-featurerepresentation:– Samples:anothernamewe’lluseforexamples.
• Wediscussedsummarystatistics andvisualizingdata.
http://www.statcrunch.com/5.0/viewresult.php?resid=1024581http://cdn.okccdn.com/blog/humanexperiments/looks-v-personality.pnghttp://www.scc.ms.unimelb.edu.au/whatisstatistics/weather.html
Age Job? City Rating Income
23 Yes Van A 22,000.00
23 Yes Bur BBB 21,000.00
22 No Van CC 0.00
25 Yes Sur AAA 57,000.00
LastTime:SupervisedLearning• Wediscussedsupervisedlearning:
• Inputforanexample (dayoftheweek)isasetoffeatures (quantitiesoffood).• Outputisadesiredclasslabel(whetherornotwegotsick).• Goalofsupervisedlearning:
– Usedatatofindamodelthatoutputstherightlabelbasedonthefeatures.• Above,modelpredictswhetherfoodswillmakeyousick(evenwithnewcombinations).
– Thisframeworkcanbeappliedanyproblemwherewehaveinput/outputexamples.
Egg Milk Fish Wheat Shellfish Peanuts …
0 0.7 0 0.3 0 0
0.3 0.7 0 0.6 0 0.01
0 0 0 0.8 0 0
0.3 0.7 1.2 0 0.10 0.01
0.3 0 1.2 0.3 0.10 0.01
Sick?
1
1
0
1
1
DecisionTrees• Decisiontreesaresimpleprogramsconsistingof:
– Anestedsequenceof“if-else”decisionsbasedonthefeatures (splittingrules).– Aclasslabelasareturnvalueattheendofeachsequence.
• Exampledecisiontree:
if(milk>0.5){
return‘sick’}else{
if(egg>1)return‘sick’
elsereturn‘notsick’
}
Candrawsequencesofdecisionsasatree:
SupervisedLearningasWritingAProgram• Therearemanypossibledecisiontrees.
– We’regoingtosearchforonethatisgoodatoursupervisedlearningproblem.
• Soourinputisdataandtheoutputwillbeaprogram.– Thisiscalled“training”thesupervisedlearningmodel.– Differentthanusualinput/outputspecificationforwritingaprogram.
• SupervisedlearningisusefulwhenyouhavelotsoflabeleddataBUT:1. Problemistoocomplicatedtowriteaprogramourselves.2. Humanexpertcan’texplainwhyyouassigncertainlabels.
OR2. Wedon’thaveahumanexpertfortheproblem.
LearningADecisionStump:“SearchandScore”• We’llstartwith"decisionstumps”:
– Simpledecisiontreewith1splittingrulebasedonthresholding1feature.
• Howdowefindthebest“rule”(feature,threshold,andleaflabels)?1. Definea‘score’fortherule.2. Search fortherulewiththebestscore.
LearningADecisionStump:AccuracyScore• Mostintuitivescore:classificationaccuracy.– “Ifweusethisrule,howmanyexamplesdowelabelcorrectly?”
• Computingclassificationaccuracyfor(egg>1):– Findmostcommonlabelsifweusethisrule:
• When(egg>1),wewere“sick”2timesoutof2.• When(egg≤1),wewere“notsick”3timesoutof4.
– Computeaccuracy:• Theaccuracy(“score”)oftherule(egg>1)is5timesoutof6.
• This“score”evaluatesqualityofarule.– We“learn”adecisionstumpbyfindingtherulewiththebestscore.
Sick?
1
1
0
0
1
0
Milk Fish Egg
0.7 0 1
0.7 0 2
0 0 0
0.7 1.2 0
0 1.2 2
0 0 0
LearningADecisionStump:ByHand• Let’ssearch forthedecisionstumpmaximizingclassificationscore:
• Highest-scoringrule:(egg>0)withleaves“sick”and“notsick”.• Noticeweonlyneedtotestfeaturethresholdsthathappeninthedata:
– Thereisnopointintestingtherule(egg>3),itgetsthe“baseline”score.– Thereisnopointintestingtherule(egg>0.5),itgetsthe(egg>0)score.– Alsonotethatwedon’tneedtotest“<“,sinceitwouldgiveequivalentrules.
Sick?
1
1
0
0
1
0
Firstwecheck“baselinerule”ofpredictingmode(nosplit):thisgets3/6accuracy.If(milk>0)predict“sick”(2/3)elsepredict“notsick”(2/3):4/6accuracyIf(fish>0)predict“notsick”(2/3)elsepredict“sick”(2/3):4/6accuracyIf(fish>1.2)predict“sick”(1/1)elsepredict“notsick”(3/5):5/6accuracyIf(egg>0)predict“sick”(3/3)elsepredict“notsick”(3/3):6/6accuracyIf(egg>1)predict“sick”(2/2)elsepredict“notsick”(3/4):5/6accuracy
Milk Fish Egg
0.7 0 1
0.7 0 2
0 1.2 0
0.7 1.2 0
0 1.3 2
0 0 0
SupervisedLearningNotation(MEMORIZETHIS)
• Featurematrix‘X’ hasrowsasexamples,columnsasfeatures.– xij isfeature‘j’forexample‘i’(quantityoffood‘j’onday‘i’).– xi isthelistofallfeaturesforexample‘i’(allthequantitiesonday‘i’).– xj iscolumn‘j’ofthematrix (thevalueoffeature‘j’acrossallexamples).
• Labelvector‘y’ containsthelabelsoftheexamples.– yi isthelabelofexample‘i’ (1for“sick”,0for“notsick”).
Egg Milk Fish Wheat Shellfish Peanuts
0 0.7 0 0.3 0 0
0.3 0.7 0 0.6 0 0.01
0 0 0 0.8 0 0
0.3 0.7 1.2 0 0.10 0.01
0.3 0 1.2 0.3 0.10 0.01
Sick?
1
1
0
1
1
SupervisedLearningNotation(MEMORIZETHIS)Egg Milk Fish Wheat Shellfish Peanuts
0 0.7 0 0.3 0 0
0.3 0.7 0 0.6 0 0.01
0 0 0 0.8 0 0
0.3 0.7 1.2 0 0.10 0.01
0.3 0 1.2 0.3 0.10 0.01
Sick?
1
1
0
1
1
SupervisedLearningNotation(MEMORIZETHIS)
• Trainingphase:– Use‘X’and‘y’tofinda‘model’(likeadecisionstump).
• Predictionphase:– Givenanexamplexi,use‘model’topredictalabel‘𝑦"i’ (“sick”or“notsick”).
• Trainingerror:– Fractionoftimesourprediction𝑦"𝑖 doesnotequalthetrueyi label.
Egg Milk Fish Wheat Shellfish Peanuts
0 0.7 0 0.3 0 0
0.3 0.7 0 0.6 0 0.01
0 0 0 0.8 0 0
0.3 0.7 1.2 0 0.10 0.01
0.3 0 1.2 0.3 0.10 0.01
Sick?
1
1
0
1
1
DecisionStumpLearningPseudo-Code
CostofDecisionStumps• Howmuchdoesthiscost?• Assumewehave:
– ‘n’examples(daysthatwemeasured).– ‘d’features(foodsthatwemeasured).– ‘k’thresholds(>0,>1,>2,…)foreachfeature.
• ComputingthescoreofonerulecostsO(n):– Weneedtogothroughall‘n’examplestofindmostcommonlabels.– Weneedtogothroughall‘n’examplesagaintocomputetheaccuracy.– Seenotesonwebpageforreviewof“O(n)”notation.
• Wecomputescoreforuptok*drules(‘k’thresholdsforeachof‘d’features):– SoweneedtodoanO(n)operationk*dtimes,givingtotalcostofO(ndk).
CostofDecisionStumps• IsacostofO(ndk)good?• SizeoftheinputdataisO(nd):
– If‘k’issmallthenthecostisroughlythesamecostasloadingthedata.• Weshouldbehappyaboutthis,youcanlearnonanydatasetyoucanload!
– If‘k’islargethenthiscouldbetooslowforlargedatasets.
• Example:ifallourfeaturesarebinary thenk=1,justtest(feature>0):– CostoffittingdecisionstumpisO(nd),sowecanfithugedatasets.
• Example:ifallourfeaturesarenumerical withuniquevaluesthenk=n.– CostoffittingdecisionstumpisO(n2d).
• Wedon’tlikehavingn2becausewewanttofitdatasetswhere‘n’islarge!– Bonusslides:howtoreducethecostinthiscasedowntoO(nd logn).
• Basicidea:sortfeaturesandtracklabels. Allowsustofitdecisionstumpstohugedatasets.
(pause)
DecisionTreeLearning• Decisionstumps haveonly1rulebasedononly1feature.– Verylimitedclassofmodels:usuallynotveryaccurateformosttasks.
• Decisiontreesallowsequencesofsplits basedonmultiplefeatures.– Verygeneralclassofmodels:cangetveryhighaccuracy.– However,it’scomputationallyinfeasibletofindthebestdecisiontree.
• Mostcommondecisiontreelearningalgorithminpractice:– Greedyrecursivesplitting.
ExampleofGreedyRecursiveSplitting• Startwiththefulldataset:
Egg Milk …
0 0.7
1 0.7
0 0
1 0.6
1 0
2 0.6
0 1
2 0
0 0.3
1 0.6
2 0
Findthedecisionstumpwiththebestscore:
Splitintotwosmallerdatasetsbasedonstump:Egg Milk …
0 0
1 0
2 0
0 0.3
2 0
Egg Milk …
0 0.7
1 0.7
1 0.6
2 0.6
0 1
1 0.6
Sick?
1
1
0
1
0
1
1
1
0
0
1
Sick?
0
0
1
0
1
Sick?
1
1
1
1
1
0
GreedyRecursiveSplittingWenowhaveadecisionstumpandtwodatasets:
Egg Milk … Sick?
0 0 0
1 0 0
2 0 1
0 0.3 0
2 0 1
Egg Milk … Sick?
0 0.7 1
1 0.7 1
1 0.6 1
2 0.6 1
0 1 1
1 0.6 0
Fitadecisionstumptoeachleaf’sdata.
GreedyRecursiveSplittingWenowhaveadecisionstumpandtwodatasets:
Egg Milk … Sick?
0 0 0
1 0 0
2 0 1
0 0.3 0
2 0 1
Egg Milk … Sick?
0 0.7 1
1 0.7 1
1 0.6 1
2 0.6 1
0 1 1
1 0.6 0
Fitadecisionstumptoeachleaf’sdata.Thenaddthesestumpstothetree.
GreedyRecursiveSplittingThisgivesa“depth2”decisiontree: Itsplitsthetwodatasetsintofourdatasets:
Egg Milk … Sick?
0 0 0
1 0 0
2 0 1
0 0.3 0
2 0 1
Egg Milk … Sick?
0 0.7 1
1 0.7 1
1 0.6 1
2 0.6 1
0 1 1
1 0.6 0
Egg Milk … Sick?
0 0 0
1 0 0
0 0.3 0
Egg Milk … Sick?
2 0 1
2 0 1
Egg Milk … Sick?
0 0.7 1
1 0.7 1
1 0.6 1
2 0.6 1
Egg Milk … Sick?
1 0.6 0
GreedyRecursiveSplittingWecouldtrytosplitthefourleavestomakea“depth3”decisiontree:
Wemightcontinuesplittinguntil:- Theleaveseachhaveonlyonelabel.- Wereachauser-definedmaximumdepth.
Whichscorefunctionshouldadecisiontreeused?
• Shouldn’twejustuseaccuracyscore?– Forleafs:yes,justmaximizeaccuracy.– Forinternalnodes:notnecessarily.
• Maybenosimplerulelike(egg>0.5)improvesaccuracy.– Butthisdoesn’tnecessarilymeanweshouldstop!
ExampleWhereAccuracyFails• Consideradatasetwith2featuresand2classes(‘x’and‘o’).
– Becausethereare2features,wecandraw‘X’asascatterplot.• Coloursandshapesdenotetheclasslabels ‘y’.
• Adecisionstumpwoulddividespacebyahorizontalorverticalline.– Testingwhetherxi1 >torwhetherxi2 >t.
• Onthisdatasetnohorizontal/verticallineimprovesaccuracy.– Baselineis‘o’,butneedtogetmany‘o’wrongtogetone‘x’right.
Whichscorefunctionshouldadecisiontreeused?
• Mostcommonscoreinpracticeis“informationgain”.– “Choosesplitthatdecreasesentropy oflabelsthemost”.
• Informationgainforbaselinerule(“donothing”)is0.– Infogain islargeiflabelsare“morepredictable”(“lessrandom”)innextlayer.
• Evenifitdoesnotincreaseclassificationaccuracyatonedepth,wehopethatitmakesclassificationeasieratthenextdepth.
ExampleWhereAccuracyFails
ExampleWhereAccuracyFails
ExampleWhereAccuracyFails
DiscussionofDecisionTreeLearning• Advantages:
– Easytoimplement.– Interpretable.– Learningisfastpredictionisveryfast.– Canelegantlyhandleasmallnumbermissingvaluesduringtraining.
• Disadvantages:– Hardtofindoptimalsetofrules.– Greedysplittingoftennotaccurate,requiresverydeeptrees.
• Issues:– Canyourevisitafeature?
• Yes,knowingotherinformationcouldmakefeaturerelevantagain.– Morecomplicatedrules?
• Yes,butsearchingforthebestrulegetsmuchmoreexpensive.– Whatisbestscore?
• Infogain isthemostpopularandoftenworkswell,butisnotalwaysthebest.– Whatisyougetnewdata?
• Youcouldconsidersplittingifthereisenoughdataattheleaves,butoccasionallymightwanttore-learnthewholetreeorsub-trees.– Whatdepth?
Summary• Supervisedlearning:– Usingdatatowriteaprogrambasedoninput/outputexamples.
• Decisiontrees:predictingalabelusingasequenceofsimplerules.• Decisionstumps:simpledecisiontreethatisveryfasttofit.• Greedyrecursivesplitting:usesasequenceofstumpstofitatree.– Veryfastandinterpretable,butnotalwaysthemostaccurate.
• Informationgain:splittingscorebasedondecreasingentropy.
• Nexttime:themostimportantideasinmachinelearning.
EntropyFunction
OtherConsiderationsforFoodAllergyExample• Whattypesofpreprocessing mightwedo?
– Datacleaning:checkforandfixmissing/unreasonablevalues.– Summarystatistics:
• Canhelpidentify“unclean”data.• Correlationmightrevealanobviousdependence(“sick”ó “peanuts”).
– Datatransformations:• Converteverythingtosamescale?(e.g.,grams)• Addfoodsfromdaybefore?(maybe“sick”dependsonmultipledays)• Adddate?(maybewhatmakesyou“sick”changesovertime).
– Datavisualization:lookatascatterplotofeachfeatureandthelabel.• Maybethevisualizationwillshowsomethingweirdinthefeatures.• Maybethepatternisreallyobvious!
• Whatyoudomightdependonhowmuchdatayouhave:– Verylittledata:
• Representfoodbycommonallergicingredients(lactose,gluten,etc.)?– Lotsofdata:
• Usemorefine-grainedfeatures(breadfrombakeryvs.hamburgerbun)?
JuliaDecisionStumpCode(notO(nlogn)yet)
GoingfromO(n2d)toO(nd logn)forNumericalFeatures
• Dowehavetocomputescorefromscratch?– Asanexample,assumeweeatintegernumberofeggs:
• Sotherules(egg>1)and(egg>2)havesamedecisions,exceptwhen(egg==2).• Wecanactuallycomputethebestruleinvolving‘egg’inO(nlogn):
– Sorttheexamplesbasedon‘egg’,andusethesepositionstore-arrange‘y’.– Gothroughthesortedvaluesinorder,updatingthecountsof#sickand#not-sickthat
bothsatisfyanddon’tsatisfytherules.– Withthesecounts,it’seasytocomputetheclassificationaccuracy(seebonusslide).
• SortingcostsO(nlogn)perfeature.• TotalcostofupdatingcountsisO(n)perfeature.• TotalcostisreducedfromO(n2d)toO(nd logn).• Thisisagoodruntime:
– O(nd)isthesizeofdata,sameasruntimeuptoalogfactor.– Wecanapplythisalgorithmtohugedatasets.
HowdowefitstumpsinO(nd logn)?• Let’ssaywe’retryingtofindthebestruleinvolvingmilk:
Egg Milk …
0 0.7
1 0.7
0 0
1 0.6
1 0
2 0.6
0 1
2 0
0 0.3
1 0.6
2 0
Sick?
1
1
0
1
0
1
1
1
0
0
1
Firstgrabthemilkcolumnandsortit(usingthesortpositionstore-arrangethesickcolumn).ThisstepcostsO(nlogn)duetosorting.
Now,we’llgothroughthemilkvaluesinorder,keepingtrackof#sickand#notsickthatareabove/belowthecurrentvalue.E.g.,#sickabove0.3is5.
Withthesecounts,accuracyscoreis(sumofmostcommonlabelaboveandbelow)/n.
Milk
0
0
0
0
0.3
0.6
0.6
0.6
0.7
0.7
1
Sick?
0
0
0
0
0
1
1
0
1
1
1
HowdowefitstumpsinO(nd logn)?
Milk
0
0
0
0
0.3
0.6
0.6
0.6
0.7
0.7
1
Sick?
0
0
0
0
0
1
1
0
1
1
1
Startwiththebaselinerule()whichisalways“satisfied”:Ifsatisfied,#sick=5and#not-sick=6.Ifnotsatisfied,#sick=0and#not-sick=0.Thisgivesaccuracyof(6+0)/n=6/11.
Nexttrytherule(milk>0),andupdatethecountsbasedonthese4rows:Ifsatisfied,#sick=5 and#not-sick=2.Ifnotsatisfied,#sick=0and#not-sick=4.Thisgivesaccuracyof(5+4)/n=9/11,whichisbetter.
Nexttrytherule(milk>0.3),andupdatethecountsbasedonthis1row:Ifsatisfied,#sick=5 and#not-sick=1.Ifnotsatisfied,#sick=0and#not-sick=5.Thisgivesaccuracyof(5+5)/n=10/11,whichisbetter.(andkeepgoinguntilyougettotheend…)
HowdowefitstumpsinO(nd logn)?
Milk
0
0
0
0
0.3
0.6
0.6
0.6
0.7
0.7
1
Sick?
0
0
0
0
0
1
1
0
1
1
1
Noticethatforeachrow,updatingthecountsonlycostsO(1).SincethereareO(n)rows,totalcostofupdatingcountsisO(n).
Insteadof2labels(sickvs.not-sick),considerthecaseof‘k’labels:- UpdatingthecountsstillcostsO(n),sinceeachrowhasonelabel.- Butcomputingthe‘max’acrossthelabelscostsO(k),socostisO(kn).
With‘k’labels,youcandecreasecostusinga“max-heap”datastructure:- CostofgettingmaxisO(1),costofupdatingheapforarowisO(logk).- Butk<=n(eachrowhasonlyonelabel).- SocostisinO(logn)foronerow.
SincetheaboveshowswecanfindbestruleinonecolumninO(nlogn),totalcosttofindbestruleacrossall‘d’columnsisO(nd logn).
Candecisiontreesre-visitafeature?• Yes.
Knowing(icecream>0.3)makessmallmilkquantitiesrelevant.
Candecisiontreeshavemorecomplicatedrules?
• Yes!• Rulesthatdependonmorethanonefeature:
• Butnowsearchingforthebestrulecangetexpensive.
Candecisiontreeshavemorecomplicatedrules?
• Yes!• Rulesthatdependonmorethanonethreshold:
• “VerySimpleClassificationRulesPerformWellonMostCommonlyUsedDatasets”– Considerdecisionstumpsbasedonmultiplesplitsof1attribute.– Showedthatthisgivescomparableperformancetomore-fancymethodsonmanydatasets.
Doesbeinggreedyactuallyhurt?• Can’tyoujustgodeepertocorrectgreedydecisions?– Yes,butyouneedto“re-discover”ruleswithlessdata.
• Considerthatyouareallergictomilk(anddrinkthisoften),andalsogetsickwhenyou(rarely)combinedietcokewithmentos.
• Greedymethodshouldfirstsplitonmilk(helpsaccuracythemost):
Doesbeinggreedyactuallyhurt?• Can’tyoujustgodeepertocorrectgreedydecisions?– Yes,butyouneedto“re-discover”ruleswithlessdata.
• Considerthatyouareallergictomilk(anddrinkthisoften),andalsogetsickwhenyou(rarely)combinedietcokewithmentos.
• Greedymethodshouldfirstsplitonmilk(helpsaccuracythemost).• Non-greedymethodcouldgetsimplertree(splitonmilklater):
DecisionTreeswithProbabilisticPredictions• Often,we’llhavemultiple‘y’valuesateachleafnode.• Inthesecases,wemightreturnprobabilitiesinsteadofalabel.
• E.g.,ifintheleafnodewe5have“sick”examplesand1“notsick”:– Returnp(y=“sick”|xi)=5/6andp(y=“notsick”|xi)=1/6.
• Ingeneral,anaturalestimateoftheprobabilitiesattheleafnodes:– Let‘nk’bethenumberofexamplesthatarrivetoleafnode‘k’.– Let‘nkc’bethenumberoftimes(y==c)intheexamplesatleafnode‘k’.– Maximumlikelihoodestimateforthisleafisp(y=c|xi)=nkc/nk.
AlternativeStoppingRules• Therearemorecomplicatedrulesfordecidingwhen*not*tosplit.
• Rulesbasedonminimumsamplesize.– Don’tsplitanynodeswherethenumberofexamplesislessthansome‘m’.– Don’tsplitanynodesthatcreatechildrenwithlessthan‘m’examples.
• Thesetypesofrulestrytomakesurethatyouhaveenoughdatatojustifydecisions.
• Alternately,youcanuseavalidationset(seenextlecture):– Don’tsplitthenodeifitdecreasesanapproximationoftestaccuracy.