CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L3.pdf · Machine Learning and Data...

CPSC340:MachineLearningandDataMining

DecisionTreesFall2020

Admin• Assignment1 isdueFriday:startearly.• Waitinglistpeople:youshouldberegisteredsoon-ish.

– Startontheassignmentnow,everybodycurrentlyonthewaitinglistwillgetin.• Gradescope

– Useaccesscode9PX4B9– Youmust [email protected] emailalias(aslistedinhttps://www.cs.ubc.ca/getacct)toidentifyyourselfonGradescope

– Failuretoproperlyidentifyyourselfwillresultinazeroforallhomeworkandexamsubmissionsmadeunderadifferentidentity

• Coursewebpage:https://www.cs.ubc.ca/~fwood/CS340/– SignupforPiazza.

• Tutorialsandofficehourshavealreadystarted(seewebpageforcalendar).

LastTime:DataRepresentationandExploration• Wediscussedexample-featurerepresentation:– Samples:anothernamewe’lluseforexamples.

• Wediscussedsummarystatistics andvisualizingdata.

http://www.statcrunch.com/5.0/viewresult.php?resid=1024581http://cdn.okccdn.com/blog/humanexperiments/looks-v-personality.pnghttp://www.scc.ms.unimelb.edu.au/whatisstatistics/weather.html

Age Job? City Rating Income

23 Yes Van A 22,000.00

23 Yes Bur BBB 21,000.00

22 No Van CC 0.00

25 Yes Sur AAA 57,000.00

LastTime:SupervisedLearning• Wediscussedsupervisedlearning:

• Inputforanexample (dayoftheweek)isasetoffeatures (quantitiesoffood).• Outputisadesiredclasslabel(whetherornotwegotsick).• Goalofsupervisedlearning:

– Usedatatofindamodelthatoutputstherightlabelbasedonthefeatures.• Above,modelpredictswhetherfoodswillmakeyousick(evenwithnewcombinations).

– Thisframeworkcanbeappliedanyproblemwherewehaveinput/outputexamples.

Egg Milk Fish Wheat Shellfish Peanuts …

0 0.7 0 0.3 0 0

0.3 0.7 0 0.6 0 0.01

0 0 0 0.8 0 0

0.3 0.7 1.2 0 0.10 0.01

0.3 0 1.2 0.3 0.10 0.01

Sick?

1

1

0

1

1

DecisionTrees• Decisiontreesaresimpleprogramsconsistingof:

– Anestedsequenceof“if-else”decisionsbasedonthefeatures (splittingrules).– Aclasslabelasareturnvalueattheendofeachsequence.

• Exampledecisiontree:

if(milk>0.5){

return‘sick’}else{

if(egg>1)return‘sick’

elsereturn‘notsick’

}

Candrawsequencesofdecisionsasatree:

SupervisedLearningasWritingAProgram• Therearemanypossibledecisiontrees.

– We’regoingtosearchforonethatisgoodatoursupervisedlearningproblem.

• Soourinputisdataandtheoutputwillbeaprogram.– Thisiscalled“training”thesupervisedlearningmodel.– Differentthanusualinput/outputspecificationforwritingaprogram.

• SupervisedlearningisusefulwhenyouhavelotsoflabeleddataBUT:1. Problemistoocomplicatedtowriteaprogramourselves.2. Humanexpertcan’texplainwhyyouassigncertainlabels.

OR2. Wedon’thaveahumanexpertfortheproblem.

LearningADecisionStump:“SearchandScore”• We’llstartwith"decisionstumps”:

– Simpledecisiontreewith1splittingrulebasedonthresholding1feature.

• Howdowefindthebest“rule”(feature,threshold,andleaflabels)?1. Definea‘score’fortherule.2. Search fortherulewiththebestscore.

LearningADecisionStump:AccuracyScore• Mostintuitivescore:classificationaccuracy.– “Ifweusethisrule,howmanyexamplesdowelabelcorrectly?”

• Computingclassificationaccuracyfor(egg>1):– Findmostcommonlabelsifweusethisrule:

• When(egg>1),wewere“sick”2timesoutof2.• When(egg≤1),wewere“notsick”3timesoutof4.

– Computeaccuracy:• Theaccuracy(“score”)oftherule(egg>1)is5timesoutof6.

• This“score”evaluatesqualityofarule.– We“learn”adecisionstumpbyfindingtherulewiththebestscore.

Sick?

1

1

0

0

1

0

Milk Fish Egg

0.7 0 1

0.7 0 2

0 0 0

0.7 1.2 0

0 1.2 2

0 0 0

LearningADecisionStump:ByHand• Let’ssearch forthedecisionstumpmaximizingclassificationscore:

• Highest-scoringrule:(egg>0)withleaves“sick”and“notsick”.• Noticeweonlyneedtotestfeaturethresholdsthathappeninthedata:

– Thereisnopointintestingtherule(egg>3),itgetsthe“baseline”score.– Thereisnopointintestingtherule(egg>0.5),itgetsthe(egg>0)score.– Alsonotethatwedon’tneedtotest“<“,sinceitwouldgiveequivalentrules.

Sick?

1

1

0

0

1

0

Firstwecheck“baselinerule”ofpredictingmode(nosplit):thisgets3/6accuracy.If(milk>0)predict“sick”(2/3)elsepredict“notsick”(2/3):4/6accuracyIf(fish>0)predict“notsick”(2/3)elsepredict“sick”(2/3):4/6accuracyIf(fish>1.2)predict“sick”(1/1)elsepredict“notsick”(3/5):5/6accuracyIf(egg>0)predict“sick”(3/3)elsepredict“notsick”(3/3):6/6accuracyIf(egg>1)predict“sick”(2/2)elsepredict“notsick”(3/4):5/6accuracy

Milk Fish Egg

0.7 0 1

0.7 0 2

0 1.2 0

0.7 1.2 0

0 1.3 2

0 0 0

SupervisedLearningNotation(MEMORIZETHIS)

• Featurematrix‘X’ hasrowsasexamples,columnsasfeatures.– xij isfeature‘j’forexample‘i’(quantityoffood‘j’onday‘i’).– xi isthelistofallfeaturesforexample‘i’(allthequantitiesonday‘i’).– xj iscolumn‘j’ofthematrix (thevalueoffeature‘j’acrossallexamples).

• Labelvector‘y’ containsthelabelsoftheexamples.– yi isthelabelofexample‘i’ (1for“sick”,0for“notsick”).

Egg Milk Fish Wheat Shellfish Peanuts

0 0.7 0 0.3 0 0

0.3 0.7 0 0.6 0 0.01

0 0 0 0.8 0 0

0.3 0.7 1.2 0 0.10 0.01

0.3 0 1.2 0.3 0.10 0.01

Sick?

1

1

0

1

1

SupervisedLearningNotation(MEMORIZETHIS)Egg Milk Fish Wheat Shellfish Peanuts

0 0.7 0 0.3 0 0

0.3 0.7 0 0.6 0 0.01

0 0 0 0.8 0 0

0.3 0.7 1.2 0 0.10 0.01

0.3 0 1.2 0.3 0.10 0.01

Sick?

1

1

0

1

1

SupervisedLearningNotation(MEMORIZETHIS)

• Trainingphase:– Use‘X’and‘y’tofinda‘model’(likeadecisionstump).

• Predictionphase:– Givenanexamplexi,use‘model’topredictalabel‘𝑦"i’ (“sick”or“notsick”).

• Trainingerror:– Fractionoftimesourprediction𝑦"𝑖 doesnotequalthetrueyi label.

Egg Milk Fish Wheat Shellfish Peanuts

0 0.7 0 0.3 0 0

0.3 0.7 0 0.6 0 0.01

0 0 0 0.8 0 0

0.3 0.7 1.2 0 0.10 0.01

0.3 0 1.2 0.3 0.10 0.01

Sick?

1

1

0

1

1

DecisionStumpLearningPseudo-Code

CostofDecisionStumps• Howmuchdoesthiscost?• Assumewehave:

– ‘n’examples(daysthatwemeasured).– ‘d’features(foodsthatwemeasured).– ‘k’thresholds(>0,>1,>2,…)foreachfeature.

• ComputingthescoreofonerulecostsO(n):– Weneedtogothroughall‘n’examplestofindmostcommonlabels.– Weneedtogothroughall‘n’examplesagaintocomputetheaccuracy.– Seenotesonwebpageforreviewof“O(n)”notation.

• Wecomputescoreforuptok*drules(‘k’thresholdsforeachof‘d’features):– SoweneedtodoanO(n)operationk*dtimes,givingtotalcostofO(ndk).

CostofDecisionStumps• IsacostofO(ndk)good?• SizeoftheinputdataisO(nd):

– If‘k’issmallthenthecostisroughlythesamecostasloadingthedata.• Weshouldbehappyaboutthis,youcanlearnonanydatasetyoucanload!

– If‘k’islargethenthiscouldbetooslowforlargedatasets.

• Example:ifallourfeaturesarebinary thenk=1,justtest(feature>0):– CostoffittingdecisionstumpisO(nd),sowecanfithugedatasets.

• Example:ifallourfeaturesarenumerical withuniquevaluesthenk=n.– CostoffittingdecisionstumpisO(n2d).

• Wedon’tlikehavingn2becausewewanttofitdatasetswhere‘n’islarge!– Bonusslides:howtoreducethecostinthiscasedowntoO(nd logn).

• Basicidea:sortfeaturesandtracklabels. Allowsustofitdecisionstumpstohugedatasets.

(pause)

DecisionTreeLearning• Decisionstumps haveonly1rulebasedononly1feature.– Verylimitedclassofmodels:usuallynotveryaccurateformosttasks.

• Decisiontreesallowsequencesofsplits basedonmultiplefeatures.– Verygeneralclassofmodels:cangetveryhighaccuracy.– However,it’scomputationallyinfeasibletofindthebestdecisiontree.

• Mostcommondecisiontreelearningalgorithminpractice:– Greedyrecursivesplitting.

ExampleofGreedyRecursiveSplitting• Startwiththefulldataset:

Egg Milk …

0 0.7

1 0.7

0 0

1 0.6

1 0

2 0.6

0 1

2 0

0 0.3

1 0.6

2 0

Findthedecisionstumpwiththebestscore:

Splitintotwosmallerdatasetsbasedonstump:Egg Milk …

0 0

1 0

2 0

0 0.3

2 0

Egg Milk …

0 0.7

1 0.7

1 0.6

2 0.6

0 1

1 0.6

Sick?

1

1

0

1

0

1

1

1

0

0

1

Sick?

0

0

1

0

1

Sick?

1

1

1

1

1

0

GreedyRecursiveSplittingWenowhaveadecisionstumpandtwodatasets:

Egg Milk … Sick?

0 0 0

1 0 0

2 0 1

0 0.3 0

2 0 1

Egg Milk … Sick?

0 0.7 1

1 0.7 1

1 0.6 1

2 0.6 1

0 1 1

1 0.6 0

Fitadecisionstumptoeachleaf’sdata.

GreedyRecursiveSplittingWenowhaveadecisionstumpandtwodatasets:

Egg Milk … Sick?

0 0 0

1 0 0

2 0 1

0 0.3 0

2 0 1

Egg Milk … Sick?

0 0.7 1

1 0.7 1

1 0.6 1

2 0.6 1

0 1 1

1 0.6 0

Fitadecisionstumptoeachleaf’sdata.Thenaddthesestumpstothetree.

GreedyRecursiveSplittingThisgivesa“depth2”decisiontree: Itsplitsthetwodatasetsintofourdatasets:

Egg Milk … Sick?

0 0 0

1 0 0

2 0 1

0 0.3 0

2 0 1

Egg Milk … Sick?

0 0.7 1

1 0.7 1

1 0.6 1

2 0.6 1

0 1 1

1 0.6 0

Egg Milk … Sick?

0 0 0

1 0 0

0 0.3 0

Egg Milk … Sick?

2 0 1

2 0 1

Egg Milk … Sick?

0 0.7 1

1 0.7 1

1 0.6 1

2 0.6 1

Egg Milk … Sick?

1 0.6 0

GreedyRecursiveSplittingWecouldtrytosplitthefourleavestomakea“depth3”decisiontree:

Wemightcontinuesplittinguntil:- Theleaveseachhaveonlyonelabel.- Wereachauser-definedmaximumdepth.

Whichscorefunctionshouldadecisiontreeused?

• Shouldn’twejustuseaccuracyscore?– Forleafs:yes,justmaximizeaccuracy.– Forinternalnodes:notnecessarily.

• Maybenosimplerulelike(egg>0.5)improvesaccuracy.– Butthisdoesn’tnecessarilymeanweshouldstop!

ExampleWhereAccuracyFails• Consideradatasetwith2featuresand2classes(‘x’and‘o’).

– Becausethereare2features,wecandraw‘X’asascatterplot.• Coloursandshapesdenotetheclasslabels ‘y’.

• Adecisionstumpwoulddividespacebyahorizontalorverticalline.– Testingwhetherxi1 >torwhetherxi2 >t.

• Onthisdatasetnohorizontal/verticallineimprovesaccuracy.– Baselineis‘o’,butneedtogetmany‘o’wrongtogetone‘x’right.

Whichscorefunctionshouldadecisiontreeused?

• Mostcommonscoreinpracticeis“informationgain”.– “Choosesplitthatdecreasesentropy oflabelsthemost”.

• Informationgainforbaselinerule(“donothing”)is0.– Infogain islargeiflabelsare“morepredictable”(“lessrandom”)innextlayer.

• Evenifitdoesnotincreaseclassificationaccuracyatonedepth,wehopethatitmakesclassificationeasieratthenextdepth.

ExampleWhereAccuracyFails

DiscussionofDecisionTreeLearning• Advantages:

– Easytoimplement.– Interpretable.– Learningisfastpredictionisveryfast.– Canelegantlyhandleasmallnumbermissingvaluesduringtraining.

• Disadvantages:– Hardtofindoptimalsetofrules.– Greedysplittingoftennotaccurate,requiresverydeeptrees.

• Issues:– Canyourevisitafeature?

• Yes,knowingotherinformationcouldmakefeaturerelevantagain.– Morecomplicatedrules?

• Yes,butsearchingforthebestrulegetsmuchmoreexpensive.– Whatisbestscore?

• Infogain isthemostpopularandoftenworkswell,butisnotalwaysthebest.– Whatisyougetnewdata?

• Youcouldconsidersplittingifthereisenoughdataattheleaves,butoccasionallymightwanttore-learnthewholetreeorsub-trees.– Whatdepth?

Summary• Supervisedlearning:– Usingdatatowriteaprogrambasedoninput/outputexamples.

• Decisiontrees:predictingalabelusingasequenceofsimplerules.• Decisionstumps:simpledecisiontreethatisveryfasttofit.• Greedyrecursivesplitting:usesasequenceofstumpstofitatree.– Veryfastandinterpretable,butnotalwaysthemostaccurate.

• Informationgain:splittingscorebasedondecreasingentropy.

• Nexttime:themostimportantideasinmachinelearning.

EntropyFunction

OtherConsiderationsforFoodAllergyExample• Whattypesofpreprocessing mightwedo?

– Datacleaning:checkforandfixmissing/unreasonablevalues.– Summarystatistics:

• Canhelpidentify“unclean”data.• Correlationmightrevealanobviousdependence(“sick”ó “peanuts”).

– Datatransformations:• Converteverythingtosamescale?(e.g.,grams)• Addfoodsfromdaybefore?(maybe“sick”dependsonmultipledays)• Adddate?(maybewhatmakesyou“sick”changesovertime).

– Datavisualization:lookatascatterplotofeachfeatureandthelabel.• Maybethevisualizationwillshowsomethingweirdinthefeatures.• Maybethepatternisreallyobvious!

• Whatyoudomightdependonhowmuchdatayouhave:– Verylittledata:

• Representfoodbycommonallergicingredients(lactose,gluten,etc.)?– Lotsofdata:

• Usemorefine-grainedfeatures(breadfrombakeryvs.hamburgerbun)?

JuliaDecisionStumpCode(notO(nlogn)yet)

GoingfromO(n2d)toO(nd logn)forNumericalFeatures

• Dowehavetocomputescorefromscratch?– Asanexample,assumeweeatintegernumberofeggs:

• Sotherules(egg>1)and(egg>2)havesamedecisions,exceptwhen(egg==2).• Wecanactuallycomputethebestruleinvolving‘egg’inO(nlogn):

– Sorttheexamplesbasedon‘egg’,andusethesepositionstore-arrange‘y’.– Gothroughthesortedvaluesinorder,updatingthecountsof#sickand#not-sickthat

bothsatisfyanddon’tsatisfytherules.– Withthesecounts,it’seasytocomputetheclassificationaccuracy(seebonusslide).

• SortingcostsO(nlogn)perfeature.• TotalcostofupdatingcountsisO(n)perfeature.• TotalcostisreducedfromO(n2d)toO(nd logn).• Thisisagoodruntime:

– O(nd)isthesizeofdata,sameasruntimeuptoalogfactor.– Wecanapplythisalgorithmtohugedatasets.

HowdowefitstumpsinO(nd logn)?• Let’ssaywe’retryingtofindthebestruleinvolvingmilk:

Egg Milk …

0 0.7

1 0.7

0 0

1 0.6

1 0

2 0.6

0 1

2 0

0 0.3

1 0.6

2 0

Sick?

1

1

0

1

0

1

1

1

0

0

1

Firstgrabthemilkcolumnandsortit(usingthesortpositionstore-arrangethesickcolumn).ThisstepcostsO(nlogn)duetosorting.

Now,we’llgothroughthemilkvaluesinorder,keepingtrackof#sickand#notsickthatareabove/belowthecurrentvalue.E.g.,#sickabove0.3is5.

Withthesecounts,accuracyscoreis(sumofmostcommonlabelaboveandbelow)/n.

Milk

0

0

0

0

0.3

0.6

0.6

0.6

0.7

0.7

1

Sick?

0

0

0

0

0

1

1

0

1

1

1

HowdowefitstumpsinO(nd logn)?

Milk

0

0

0

0

0.3

0.6

0.6

0.6

0.7

0.7

1

Sick?

0

0

0

0

0

1

1

0

1

1

1

Startwiththebaselinerule()whichisalways“satisfied”:Ifsatisfied,#sick=5and#not-sick=6.Ifnotsatisfied,#sick=0and#not-sick=0.Thisgivesaccuracyof(6+0)/n=6/11.

Nexttrytherule(milk>0),andupdatethecountsbasedonthese4rows:Ifsatisfied,#sick=5 and#not-sick=2.Ifnotsatisfied,#sick=0and#not-sick=4.Thisgivesaccuracyof(5+4)/n=9/11,whichisbetter.

Nexttrytherule(milk>0.3),andupdatethecountsbasedonthis1row:Ifsatisfied,#sick=5 and#not-sick=1.Ifnotsatisfied,#sick=0and#not-sick=5.Thisgivesaccuracyof(5+5)/n=10/11,whichisbetter.(andkeepgoinguntilyougettotheend…)

HowdowefitstumpsinO(nd logn)?

Milk

0

0

0

0

0.3

0.6

0.6

0.6

0.7

0.7

1

Sick?

0

0

0

0

0

1

1

0

1

1

1

Noticethatforeachrow,updatingthecountsonlycostsO(1).SincethereareO(n)rows,totalcostofupdatingcountsisO(n).

Insteadof2labels(sickvs.not-sick),considerthecaseof‘k’labels:- UpdatingthecountsstillcostsO(n),sinceeachrowhasonelabel.- Butcomputingthe‘max’acrossthelabelscostsO(k),socostisO(kn).

With‘k’labels,youcandecreasecostusinga“max-heap”datastructure:- CostofgettingmaxisO(1),costofupdatingheapforarowisO(logk).- Butk<=n(eachrowhasonlyonelabel).- SocostisinO(logn)foronerow.

SincetheaboveshowswecanfindbestruleinonecolumninO(nlogn),totalcosttofindbestruleacrossall‘d’columnsisO(nd logn).

Candecisiontreesre-visitafeature?• Yes.

Knowing(icecream>0.3)makessmallmilkquantitiesrelevant.

Candecisiontreeshavemorecomplicatedrules?

• Yes!• Rulesthatdependonmorethanonefeature:

• Butnowsearchingforthebestrulecangetexpensive.

Candecisiontreeshavemorecomplicatedrules?

• Yes!• Rulesthatdependonmorethanonethreshold:

• “VerySimpleClassificationRulesPerformWellonMostCommonlyUsedDatasets”– Considerdecisionstumpsbasedonmultiplesplitsof1attribute.– Showedthatthisgivescomparableperformancetomore-fancymethodsonmanydatasets.

Doesbeinggreedyactuallyhurt?• Can’tyoujustgodeepertocorrectgreedydecisions?– Yes,butyouneedto“re-discover”ruleswithlessdata.

• Considerthatyouareallergictomilk(anddrinkthisoften),andalsogetsickwhenyou(rarely)combinedietcokewithmentos.

• Greedymethodshouldfirstsplitonmilk(helpsaccuracythemost):

Doesbeinggreedyactuallyhurt?• Can’tyoujustgodeepertocorrectgreedydecisions?– Yes,butyouneedto“re-discover”ruleswithlessdata.

• Considerthatyouareallergictomilk(anddrinkthisoften),andalsogetsickwhenyou(rarely)combinedietcokewithmentos.

• Greedymethodshouldfirstsplitonmilk(helpsaccuracythemost).• Non-greedymethodcouldgetsimplertree(splitonmilklater):

DecisionTreeswithProbabilisticPredictions• Often,we’llhavemultiple‘y’valuesateachleafnode.• Inthesecases,wemightreturnprobabilitiesinsteadofalabel.

• E.g.,ifintheleafnodewe5have“sick”examplesand1“notsick”:– Returnp(y=“sick”|xi)=5/6andp(y=“notsick”|xi)=1/6.

• Ingeneral,anaturalestimateoftheprobabilitiesattheleafnodes:– Let‘nk’bethenumberofexamplesthatarrivetoleafnode‘k’.– Let‘nkc’bethenumberoftimes(y==c)intheexamplesatleafnode‘k’.– Maximumlikelihoodestimateforthisleafisp(y=c|xi)=nkc/nk.

AlternativeStoppingRules• Therearemorecomplicatedrulesfordecidingwhen*not*tosplit.

• Rulesbasedonminimumsamplesize.– Don’tsplitanynodeswherethenumberofexamplesislessthansome‘m’.– Don’tsplitanynodesthatcreatechildrenwithlessthan‘m’examples.

• Thesetypesofrulestrytomakesurethatyouhaveenoughdatatojustifydecisions.

• Alternately,youcanuseavalidationset(seenextlecture):– Don’tsplitthenodeifitdecreasesanapproximationoftestaccuracy.

CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L3.pdf · Machine Learning and Data...

Documents

Transcript of CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L3.pdf · Machine Learning and Data...