Decision Tree Learning

86
Decision Tree Learning 1 ACM Student Chapter, Heritage Institute of Technology 3 rd February, 2012 SIGKDD Presentation by Satarupa Guha Sudipto Banerjee Ashish Baheti

description

Decision Tree Learning. ACM Student Chapter, Heritage Institute of Technology 3 rd February, 2012 SIGKDD Presentation by Satarupa Guha Sudipto Banerjee Ashish Baheti. Machine Learning. - PowerPoint PPT Presentation

Transcript of Decision Tree Learning

Machine Learning

Decision Tree Learning1ACM Student Chapter,Heritage Institute of Technology

3rd February, 2012SIGKDD Presentation bySatarupa GuhaSudipto BanerjeeAshish Baheti

Machine LearningA computer program is said to learn from experience E with respect to a class of tasks T and performance measure P if its performances at tasks T as measured by P, improves with experience E.2An Example: Checkers learning ProblemTask T : Playing checkers

Performance P : Percent of games won by opponents

Experience E : Gained by playing against itself

3Concept LearningConcept learning can be formulated as a problem of searching through a predefined space of potential hypotheses for the hypothesis that best fits the training examples.

Much of learning involves acquiring general concepts from specific training examples. 4Representing HypothesesLet H be a hypothesis space. For each h belonging to H, h is a conjunction of literals.

Let X be a set of possible instances each described by a set of attributes. Example- < ?, A2, A3, ?, ?, A6 >Target function C: X-> {0,1}

Training examples D: positive and negative examples of the target function. ,, 5Types of Training ExamplesPositive Examples: those training examples that satisfy the target function, ie. For which c(x)=1 or TRUE.

Negative Examples: those training examples that do not satisfy the target function, ie. For which c(x)=0 or FALSE.6Attribute TypesNominal / Categorical

Ordinal

Continuous7Inductive learning hypothesisAny hypothesis found to approximate the target function well over a sufficiently large set of training examples, will also approximate the target function well over other unobserved examples.

Any hypothesis h is said to be consistent with a set of training examples D of target concept c iff h(x)=c(x) for each training example 8Classification TechniquesDecision Tree based MethodsRule-based MethodsMemory based reasoningNeural NetworksNave Bayes and Bayesian Belief NetworksSupport Vector MachinesDecision TreeGoal is to create a model that predicts the value of a target variable based on several input variables.

10Decision tree representationEach internal node tests an attribute.Each branch corresponds to an attribute value.Each leaf node assigns a classification.

11A quick recap CNF = Conjunctive Normal Form

DNF = Disjunctive Normal Form12Disjunctive Normal FormIn Boolean Algebra, a formula is in DNF if it is a disjunction of clauses, where a clause is a conjunction of literals. Also known as Sum of Products.

Example: (A ^ B ^ C) V (B ^ C) Conjunctive Normal FormIn Boolean Algebra, a formula is in CNF if it is a conjunction of clauses, where a clause is a disjunction of literals. Also known as Product of Sum.

Example: (A V B V C) ^ (B V C)Decision Tree: contd.Decision trees represent a disjunction(OR) of conjunctions(AND) of constraints on the attribute values of instances,

Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and the tree itself to a disjunction of these conjunctions.

Hence, DT represents a DNF.15Attribute splitting2- way split

Multi- way split16Splitting Based on Nominal AttributesMulti-way split: Use as many partitions as distinct values.

Binary split: Divides values into two subsets. Need to find optimal partitioning.CarTypeFamilySportsLuxuryCarType{Family, Luxury}{Sports}CarType{Sports, Luxury}{Family}ORMulti-way split: Use as many partitions as distinct values.

Binary split: Divides values into two subsets. Need to find optimal partitioning.

Splitting Based on Ordinal AttributesSizeSmallMediumLargeSize{Medium, Large}{Small}Size{Small, Medium}{Large}ORSplitting Based on Continuous Attributes

Example of a Decision Tree

categoricalcategoricalcontinuousclassRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced< 80K> 80KSplitting AttributesTraining DataModel: Decision TreeDT Classification Task21

Measures of Node ImpurityEntropy

GINI Index

Misclassification ErrorEntropyIt characterizes the impurity of an arbitrary collection of examples. It is a measure of randomness.Entropy(S)= -p log p - p log p

Where , S is a collection containing positive & negative examples of some target concept.P is the proportion of positive ex in S.P is the proportion of negative ex in S.

2323An example of EntropyLet S is a collection of 14 examples, including 9 positive & 5 negative examples, denoted by [9+, 5-].

Then entropy[9+, 5-]= -9/14 log(9/14) 5/14 log(5/14)= 0.9424More on EntropyIn a more general sense,Entropy= 0, if all members belong to the same class.= 1, if collection contains equal no. of positive & negative examples. = lies between 0 & 1, if there are unequal no. of positive & negative examples.25GINI IndexGINI Index for a given node t :

(NOTE: p( j | t) is the relative frequency of class j at node t).

Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information

Minimum (0.0) when all records belong to one class,implying most interesting information

Examples for computing GINIP(C1) = 0/6 = 0 P(C2) = 6/6 = 1Gini = 1 (0)^2 (1)^2 = 1-0-1= 1

P(C1) = 1/6 P(C2) = 5/6Gini = 1 (1/6)^2 (5/6)^2 = 0.278

P(C1) = 2/6 P(C2) = 4/6Gini = 1 (2/6)^2 (4/6)^2 =0.444

Splitting Based on GINIWhen a node p is split into k partitions (children), the quality of split is computed as,

where,ni = number of records at child i, n = number of records at node p.

Binary Attributes: Computing GINI IndexSplits into two partitionsEffect of Weighing partitions: Larger and Purer Partitions are sought for.B?YesNoNode N1Node N2

GINI(N1) = 1 (5/7)2 (2/7)2 = 0.408GINI (N2) = 1 (1/5)2 (4/5)2 = 0.320GINI (Children) = 7/12 * 0.408+ 5/12 * 0.320= 0.371

Categorical Attributes: Computing Gini IndexFor each distinct value, gather counts for each class in the datasetUse the count matrix to make decisions

Multi-way splitTwo-way split (find best partition of values)A set of training examplesDay outlook humidity wind play tennisD1 sunny high weak noD2sunnyhigh strongnoD3overcasthigh weakyesD4rainhigh weakyesD5rainnormal weakyesD6rainnormal strongnoD7overcastnormal strongyesD8sunnyhigh weaknoD9sunnynormal weakyesD10rainnormal weakyes31Decision Tree Learning AlgorithmsVariations of a core algorithm that employs a top-down, greedy search through the space of possible decision trees.

Examples are Hunts Algorithm, CART, ID3, C4.5, SLIQ,SPRINT, Mars.

32Algorithm ID3

33 Algorithm ID3 Greedy algorithm that grows the tree top-down.

Begins with the question "which attribute should be tested at the root of the tree?

A statistical property called information gain is used.34Information GainExpected reduction in entropy caused by partitioning the example according to a particular attribute.Gain of an attribute A relative to a collection of example S is defined as-

Gain(S,A)= Entropy(S) - |Sv|/|S| Entropy(Sv) v->Values(A)whereValues(A): set of all positive values of an attribute A.Sv: subset of S for which attribute A has value v.35 Information Gain: contd.Gain(S,A) is the information provided about the target function value, given the value of some other attribute A.Example: S is a collection described by attributes including Wind, which can have the values Weak or Strong. Assume S has 14 examples.

Then S=[9+, 5-] S weak= [6+, 2-] S strong= [3+, 3-]

36Information Gain: contd.Gain(S, wind) = Entropy(S) (8/14) Entropy(S weak) (6/14) Entropy(S strong) = 0.94 (8/14) 0.811 = 0.048

37Play Tennis example: revisitedDay outlook humidity wind play tennisD1 sunny high weak noD2sunnyhigh strongnoD3overcasthigh weakyesD4rainhigh weakyesD5rainnormal weakyesD6rainnormal strongnoD7overcastnormal strongyesD8sunnyhigh weaknoD9sunnynormal weakyesD10rainnormal weakyes38Application of ID3 on Play TennisThere are 3 attributes- Outlook, humidity and WindWe need to choose one of them as the root of the tree. We make this choice based on the information gain(IG) of each of the attributes. The one with the highest IG gets to be the root.

The calculations are shown in the following slides.39Quick recap of formulaeEntropy: p log (1/p ) + p log ( 1/p ) Information Gain:Gain(S,A)= Entropy(S) - |Sv|/|S| Entropy(Sv) v->Values(A) WhereS is the collection, A is a particular attribute.Values(A): set of all positive values of an attribute A.Sv: subset of S for which attribute A has value v. 40Calculations:For Outlook:

The training set has 6 positive and 4 negative examples. Hence entropy =4/10* lg(10/4) + 6/10* lg(10/6)= 0.970Outlook can have 3 values- sunny [ 1+, 3- ] rain [ 3+, 1- ] overcast.[ 1+ ]Entropy of sunny= * lg 4 + * lg (4/3) =0.324Entropy of rain = * lg (4/3) + * lg 4 =0.324Entropy of overcast= 2/4* lg (2/2) =041Calculations:Sv/S for each of them are as follows:sunny- 4/10 (means 4 out of 10 examples have sunny as their outlook)rain - 4/10overcast- 2/10Hence, Information gain of outlook= 0.970( 4/10 *0.324*2 + 2/10*0)= 0.71142Calculations:For Humidity:The training set has 6 positive and 4 negative examples. Hence entropy = 0.970 Humidity can take 2 values- High [3+,2-] Normal [4+, 1-]Entropy of High = 3/5* lg (5/3) + 2/5* lg (5/2) = 0.970Entropy of Normal = 1/5* lg 5 + 4/5* lg(5/4)= 0.7195

43Calculations:Sv/S for High = 5/10 for normal = 5/10Hence IG( Humidity) =0.970 (5/10*0.970 + 5/10*0.7195)=0.125Similarly for Wind, the IG is 0.0910.Hence, IG(Outlook)=0.7110 IG(Humidity)=0.125 IG(Wind)= 0.0910Comparing the IG s of the 3 attributes, we find Outlook has got the highest IG(0.7110)

44Partially formed treeHence outlook is chosen as the root of the decision tree.The partially formed decision tree is as follows:

45OutlookSunny [1+,3-]Overcast[2+]Rain [3+,1-]yesFurther calculations:Since sunny and rain have both positive and negative examples, they have fair degrees of randomness and hence need to be classified further.For sunny :As computed earlier, Entropy of sunny= 0.324Now, we need to find the corresponding humidity and wind for those training examples who have outlook=sunny.46Further calculations Day Outlook Humidity wind Play tennis D1 sunny high weak no D2 sunny high strongno D8 sunny high weakno D9 sunny normal weakyes

For Humidity:Sv/S* Entropy(high)= *0=0 Sv/S* Entropy(low)=1/4*0=047Calculations:Zero because there is no randomness. All the examples that have Humidity= high have Play tennis= No and those having Humidity=low have Play Tennis=yes.IG (S sunny, Humidity)= 0.324- 0 = 0.324For Wind:Sv/S * Entropy(weak)=*(2/3*lg 3/2) + 1/3*lg 3=0.687. Sv/S* Entropy(strong)=1/4* 0=0IG(S sunny, Wind) = 0.324 0.687 = -0.363Clearly, humidity has a higher IG.4849OutlookHumidityyesyesNoSunny[1+,3-]Rain[3+,1-]Overcast[2+]High[3+]Normal[1+]Further CalculationsNow for Rain[ 3+,1-],Day Outlook Humidity Wind Play tennisD4 RainhighweakyesD5 RainnormalweakyesD6 RainnormalstrongnoD10 Rainnormalweakyes50Further CalculationsEntropy of rain= * lg (4/3) + * log 4=0.810Checking IG for Wind:Entropy(weak)= 3/3* lg(3/3) =0Entropy(strong)= lg 1 =0Hence, IG(S sunny, wind) = 0.810 0 0 = 0.810

Checking IG of Humidity:Entropy(High)= 1* lg 1 =0Entropy(Normal)=1/3* lg 3 + 2/3* lg (3/2)=0.917Hence, IG(S sunny, wind) =0.810 0 0.917*(3/4)=0.122

51Final Decision TreeOutlookHumidityWindSunnyRainYESHighNormalYESNONOYES52OvercastPlay Tennis : contd.This decision tree corresponds to the following expression:

(Outlook = Sunny ^ Humidity = Normal) V (Outlook = Overcast) V (Outlook = Rain ^ Wind = Weak)

As we can see this is actually a disjunction of conjunctions ( DNF ).

53Features of ID3Maintains only a single current hypothesis as it searches the space of decision trees. No backtracking at any step.

Uses all training examples at each step in the search.54Inductive bias in decision tree learningInductive bias is the set of assumptions that together with the training data, deductively justify the classifications assigned by the learner to future instances.

55Inductive bias in decision tree learningRoughly, ID3 search strategy- Selects in favor of shorter trees over longer ones.

Selects trees that place attributes with highest information gain closest to the root.

ID3 employs preference bias.

56Occams razorPrefer the simplest hypothesis that fits the data.

Justification: there are fewer short(hence simple) hypotheses than long ones, so it is less likely that one will find a short hypothesis that coincidentally fits the training data.

So, we might believe a 5- node tree is less likely to be a statistical coincidence and prefer this hypothesis over the 500-node hypothesis.57DT correctness evaluationA few terms:------ACCURACY------ERROR RATE------PRECISION------RECALL

An Example:

SUPPOSE THERE ARE 2 CLASSES:

CLASS 1 CLASS 0CLASS 1

CLASS 0CLASS 1 f11 f10CLASS 0 f01 f00 PREDICTED CLASSACTUALCLASSSuppose let us assume that class 1 is +ve, class 0 is ve

So f11 means that a +ve output predicted as +ve

f10 means that a +ve output- predicted as -ve

f01 means that a ve output- predicted as +ve

f00 means that a ve output- predicted as -veMeanings of the termsHence f11 and f00 are two cases which have been accurately predicted and f10 and f01 are the two cases which are predicted with an error.

ACCURACY=(f11 +f00))/(f11 + f10+ f01 + f00)

ERROR RATE=(f01 + f10)/(f11 + f10+ f01 + f00)

Clearly accuracy + error rate=1Precision = fff/ /(f11 +f01 )

Recall = f11 /(f11 + f10 )

Example- let there be 8 batsmen present . Now the prediction is made and said that there are 7 batsmen. It is found that out of these 7 ,5 are batsmen and 2 were bowlers . So precision is 5/7 and recall is 5/8

Over-fitting A hypothesis over-fits the training examples if some other hypothesis that fits the training example less well, actually performs better over the entire distribution of instances.

Given a hypothesis space H, a hypothesis h H is said to over-fit the training data if there exists some alternative hypothesis h H such that h has smaller error than h over the training ex, but h has a smaller error than h over the entire distribution of instances.65Over-fitting

Solid line : accuracy over training dataBroken line : accuracy over independent set of test examples, not included in training examples6666Causes of Over-fitting When training examples contain random errors/ noise. When the training data consists of small number of examples.67Ways to avoid over-fittingApproach that stops growing the tree earlier, before it reached the point where it perfectly classifies the training data

Approach that first over-fits the data, then post prunes the tree.

How to determine the correct final size of treeUse a separate set of examples, called Validation set distinct from the training examples, to evaluate the utility of post-pruning nodes from the tree.

Use all the available data for training, but apply a statistical test to estimate whether expanding (or pruning) a particular node is likely to produce an improvement beyond the training set.

Pruning the decision treeThere are many methods by which a decision tree can be pruned:

Reduced Error Pruning

Rule Post PruningReduced Error PruningConsider each of the decision nodes in the tree to be candidates for pruning.

Pruning a node consists of removing the sub-tree rooted at that node, making it a leaf node, and assigning it the most common classification of the training examples affiliated with that node.

Reduced Error Pruning (contd.)Nodes are pruned iteratively, always choosing the node whose removal most increases the decision tree accuracy over the validation set.

Pruning is done until further pruning results in reducing the accuracy of decision tree over the validation set.Reduced Error Pruning Performance

Rule Post PruningThe steps of Post Pruning are:

Infer the decision tree from the training set, and allow over-fitting to occur.

Convert the learned tree into an equivalent set of rules by creating one rule for each path.

Rule Post PruningPrune each rule by removing any preconditions and check its estimated accuracy.

Sort the pruned rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances.

Post Pruning (contd.)In rule post-pruning, one rule is generated for each leaf node in the tree.

Each attribute test along the path from the root to the leaf becomes a rule antecedent (pre-condition) and the classification at the leaf node becomes the rule consequent (post-condition).

Let us consider an example: Step1: Infer the decision tree from the training set, growing the tree until the training data is fit as well as possible and allowing over- fitting to occur.

Step 2: Convert the learned tree into an equivalent set of rules by creating one rule for each path from the root node to a leaf node.

Step 3: Prune (generalize) each rule by removing any preconditions that result in improving its estimated accuracy.

Step 3 :contd.

Step 4. Sort the pruned rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances.

Post Pruning for Binary CaseSS1S2SmError(S1)Error(S2)Error(Sm)P1P2PmE(S)BackUpError(S)For any node S which is not a leaf node we can calculate

BackUpError(S) = Pi Error(Si)i Error(S) =

MIN{}Pi = Num of examples in SiNum of examples in SFor leaf nodes SiError(Si) = E(Si) E(S)BackUpError(S)Decision: Prune at S ifBackUpError(S) Error(S)Example of Post PruningBefore Pruninga[6, 4]b[4, 2]c[2, 2]d[1, 2][x, y]means x YES casesand y NO casesWe underline Error(Sk)[3, 2]0.429[1, 0]0.333[1, 1]0.5[0, 1]0.333[1, 0]0.3330.3750.4130.4170.3780.50.3830.40.444PRUNEPRUNEPRUNE means cut the sub- tree below this pointResult of PruningAfter Pruninga[6, 4][4, 2]c[2, 2][1, 2][1, 0]Advantages of DT:Simple to understand and interpret.

Requires little data preparation.Able to handle both numerical andcategorical data.Possible to validate a model using statistical tests.

Robust

Performs well with large data in a short time.

85Limitations of DTThe problem of learning an optimal decision tree is known to beNP-complete.

Prone to over-fitting.

There are concepts that are hard to learn because decision trees do not express them easily, such asXOR,parityormultiplexer problems.

For data including categorical variables with different number of levels,information gain in DTare biased in favor of those attributes with more levels.86Taxable Income> 80K?< 10K[10K,25K)YesNo[25K,50K)Taxable Income?[50K,80K)> 80K(i) Binary split(ii) Multi-way splitTidRefundMarital

StatusTaxable

IncomeCheat

1YesSingle125KNo

2NoMarried100KNo

3NoSingle70KNo

4YesMarried120KNo

5NoDivorced95KYes

6NoMarried60KNo

7YesDivorced220KNo

8NoSingle85KYes

9NoMarried75KNo

10NoSingle90KYes

10

Tree Induction algorithmInductionDeductionTest SetModelTraining SetC13

C23

Gini=0.500

C10

C26

Gini=0.000

C11

C25

Gini=0.278

C12

C24

Gini=0.444

C10

C26

C11

C25

C12

C24

Parent

C16

C26

Gini = 0.500

CarType

{Sports, Luxury}{Family}

C131

C224

Gini0.400

CarType

{Sports} {Family,Luxury}

C122

C215

Gini0.419

CarType

FamilySportsLuxury

C1121

C2411

Gini0.393