Download - 13 Machine Learning Supervised Decision Trees

Page 1: 13 Machine Learning Supervised Decision Trees

Machine Learning for Data MiningGraphic Models Decision Trees

Andres Mendez-Vazquez

July 13, 2015

1 / 47

Page 2: 13 Machine Learning Supervised Decision Trees

Outline

1 IntroductionExamples

2 Decision TreesDecision TreesHow they workGeometryTypes of Decision Trees

3 Ordinary Binary Classification TreesDefinitionTrainingThe Sought CriterionsProbabilistic ImpurityFinal AlgorithmConclusions

2 / 47

Page 3: 13 Machine Learning Supervised Decision Trees

Outline

3 / 47

Page 4: 13 Machine Learning Supervised Decision Trees

An Example

We haveOutlook

Sunny Overcast Rain

Humidity Windy

High Normal True False

N PN P

P

Are we going out?

4 / 47

Page 5: 13 Machine Learning Supervised Decision Trees

Another Example - Grades

Deciding the grades

Percent 90%

Yes Grade=A

Yes

No 89% Percent 80%

Grade=B

No 79% Percent 70%

Grade=C

No Etc...

5 / 47

Page 6: 13 Machine Learning Supervised Decision Trees

Yet Another Example

Decision About Needing Glasses

Tear Production Rate

None Astigmatism

Spectacle PrecisionSoft

Hard None

Reduced Normal

NoYes

MyopeHypermetrope

6 / 47

Page 7: 13 Machine Learning Supervised Decision Trees

Outline

7 / 47

Decision Trees

Powerful/popularFor classification and prediction.

Represent rulesRules can be expressed in English.

I IF Age <=43 & Sex = Male & Credit Card Insurance = No THEN LifeInsurance Promotion = No

Rules can be expressed using SQL for query.

Usefulto explore data to gain insight into relationships of a large number ofcandidate input variables to a target (output) variable.

8 / 47

Decision Trees

8 / 47

Decision Trees

8 / 47

Decision Trees

8 / 47

Decision Trees

8 / 47

Decision Trees - What is this?

Decision TreeA structure that can be used to divide up a large collection of records intosuccessively smaller sets of records by applying a sequence of simpledecision rules.

A decision tree modelConsists of a set of rules for dividing a large heterogeneous population intosmaller, more homogeneous groups with respect to a particular targetvariable.

9 / 47

Page 14: 13 Machine Learning Supervised Decision Trees

Decision Trees - What is this?

Decision TreeA structure that can be used to divide up a large collection of records intosuccessively smaller sets of records by applying a sequence of simpledecision rules.

A decision tree modelConsists of a set of rules for dividing a large heterogeneous population intosmaller, more homogeneous groups with respect to a particular targetvariable.

9 / 47

Page 15: 13 Machine Learning Supervised Decision Trees

Decision Tree Types

Binary treesOnly two choices in each split. Can be non-uniform (uneven) in depth.

N-way trees or Ternary treesThree or more choices in at least one of its splits (3-way, 4-way, etc.).

10 / 47

Page 16: 13 Machine Learning Supervised Decision Trees

Decision Tree Types

Binary treesOnly two choices in each split. Can be non-uniform (uneven) in depth.

N-way trees or Ternary treesThree or more choices in at least one of its splits (3-way, 4-way, etc.).

10 / 47

Page 17: 13 Machine Learning Supervised Decision Trees

Definition - Decision TreesDecision TreesThey work like a flow chart.

StructureNodes

I Appear as rectangles or circlesI Represent test or decision

Lines or branches - represent outcome of a testCircles - terminal (leaf) nodes.

NodesTop or starting node is root nodeInternal nodes are used for decisionsTerminal Nodes or Leaves are the final results

11 / 47

Page 18: 13 Machine Learning Supervised Decision Trees

StructureNodes

11 / 47

Page 19: 13 Machine Learning Supervised Decision Trees

StructureNodes

11 / 47

Page 20: 13 Machine Learning Supervised Decision Trees

StructureNodes

11 / 47

Page 21: 13 Machine Learning Supervised Decision Trees

StructureNodes

11 / 47

Page 22: 13 Machine Learning Supervised Decision Trees

StructureNodes

11 / 47

Page 23: 13 Machine Learning Supervised Decision Trees

StructureNodes

11 / 47

Page 24: 13 Machine Learning Supervised Decision Trees

StructureNodes

11 / 47

Page 25: 13 Machine Learning Supervised Decision Trees

Outline

12 / 47

Page 26: 13 Machine Learning Supervised Decision Trees

How they work

How they work1 Decision rules - partition sample of data.2 Terminal node (leaf) indicates the class assignment.3 Tree partitions samples into mutually exclusive groups.4 One group for each terminal node.5 All paths:

1 Start at the root node.2 End at a leaf.

13 / 47

Page 27: 13 Machine Learning Supervised Decision Trees

How they work

13 / 47

Page 28: 13 Machine Learning Supervised Decision Trees

How they work

13 / 47

Page 29: 13 Machine Learning Supervised Decision Trees

How they work

13 / 47

Page 30: 13 Machine Learning Supervised Decision Trees

How they work

13 / 47

Page 31: 13 Machine Learning Supervised Decision Trees

How they work

13 / 47

Page 32: 13 Machine Learning Supervised Decision Trees

How they work

13 / 47

Page 33: 13 Machine Learning Supervised Decision Trees

How they work (cont...)

Each path represents a decision ruleJoining (AND) of all the tests along that path.Separate paths that result in the same class are disjunctions (ORs).

All paths - mutually exclusiveFor any one case - only one path will be followed.False decisions on the left branch.True decisions on the right branch.

14 / 47

Page 34: 13 Machine Learning Supervised Decision Trees

14 / 47

Page 35: 13 Machine Learning Supervised Decision Trees

14 / 47

Page 36: 13 Machine Learning Supervised Decision Trees

14 / 47

Page 37: 13 Machine Learning Supervised Decision Trees

14 / 47

Page 38: 13 Machine Learning Supervised Decision Trees

Outline

15 / 47

Page 39: 13 Machine Learning Supervised Decision Trees

Geometry

Something NotableFits shapes of decision boundaries between classes.Classes formed by lines parallel to axes.Result - rectangular shaped class regions.

Induction of Oblique Decision Trees

16 / 47

Page 40: 13 Machine Learning Supervised Decision Trees

GeometrySomething Notable

Fits shapes of decision boundaries between classes.Classes formed by lines parallel to axes.Result - rectangular shaped class regions.

Induction of Oblique Decision Trees

16 / 47

Page 41: 13 Machine Learning Supervised Decision Trees

Example of Geometry

Example

A

B

C D

E

A B C D E

17 / 47

Page 42: 13 Machine Learning Supervised Decision Trees

Outline

18 / 47

Page 43: 13 Machine Learning Supervised Decision Trees

Types of Decision Trees

Classification TreesThe predicted outcome is the class to which the data belongs.

Regression TreesThe predicted outcome can be considered a number.

19 / 47

Page 44: 13 Machine Learning Supervised Decision Trees

Types of Decision Trees

Classification TreesThe predicted outcome is the class to which the data belongs.

Regression TreesThe predicted outcome can be considered a number.

19 / 47

Page 45: 13 Machine Learning Supervised Decision Trees

Classification and Regression Trees (CART)

CARTThe term CART is an umbrella term used to refer to both of theabove procedures.

Introduced byIt was introduced by Breiman et. al in the book

I “Classification and Regression Trees”

SimilaritiesRegression and Classification trees have some similarities –nevertheless they differ in the way the splitting at each node is done.

20 / 47

Page 46: 13 Machine Learning Supervised Decision Trees

20 / 47

Page 47: 13 Machine Learning Supervised Decision Trees

20 / 47

Page 48: 13 Machine Learning Supervised Decision Trees

Ordinary Binary Classification Trees (OBCTs)

We will concentrate inOBCT Classification treesIf somebody want to look at more regression trees, look at“Classification and Regression Trees” by Breiman et. al .

21 / 47

Page 49: 13 Machine Learning Supervised Decision Trees

Outline

22 / 47

Page 50: 13 Machine Learning Supervised Decision Trees

Important

Most of the workIt focuses on deciding which property test or query should be performed atthe node!!!

If the data test is numerical in natureThere is a way to vizualize the decision boundaries produced by thedecision trees.

23 / 47

Page 51: 13 Machine Learning Supervised Decision Trees

Important

Most of the workIt focuses on deciding which property test or query should be performed atthe node!!!

If the data test is numerical in natureThere is a way to vizualize the decision boundaries produced by thedecision trees.

23 / 47

Page 52: 13 Machine Learning Supervised Decision Trees

Definition OBCT

DefinitionThey are binary decision trees where the basic question is xi ≤ ai?

Example

24 / 47

Page 53: 13 Machine Learning Supervised Decision Trees

Definition OBCT

DefinitionThey are binary decision trees where the basic question is xi ≤ ai?

Example

24 / 47

Page 54: 13 Machine Learning Supervised Decision Trees

Outline

25 / 47

Page 55: 13 Machine Learning Supervised Decision Trees

Training of a OBCT

We need firstAt each node, the set of candidate questions to be asked has to bedecided.Each question corresponds to a specific binary split into twodescendant nodes.Each node, t, is associated with a specific subset Xt of the trainingset X .

26 / 47

Page 56: 13 Machine Learning Supervised Decision Trees

Training of a OBCT

26 / 47

Page 57: 13 Machine Learning Supervised Decision Trees

Training of a OBCT

26 / 47

Page 58: 13 Machine Learning Supervised Decision Trees

Splitting the Node Xt

Basically, we want to split the node into two groups with questionstY == ”YES” and tN = ”NO”

With PropertiesXtY ∩XtN = ∅.XtY ∪XtN = Xt

27 / 47

Page 59: 13 Machine Learning Supervised Decision Trees

Splitting the Node Xt

Basically, we want to split the node into two groups with questionstY == ”YES” and tN = ”NO”

With PropertiesXtY ∩XtN = ∅.XtY ∪XtN = Xt

27 / 47

Page 60: 13 Machine Learning Supervised Decision Trees

Important

Given the question for each feature k “Is xk ≤ α”For each feature, every possible value of the threshold α defines a specificsplit of the subset Xt .

Thus in theoryAn infinite set of questions has to be asked if α is an interval Yα ⊆ R.

In practiceonly a finite set of questions can be considered.

28 / 47

Page 61: 13 Machine Learning Supervised Decision Trees

Important

28 / 47

Page 62: 13 Machine Learning Supervised Decision Trees

Important

28 / 47

Page 63: 13 Machine Learning Supervised Decision Trees

For example

Since the number, N , of training points in X is finiteAny of the features xk with k = 1, ..., l can take at most Nt ≤ N differentvalues

WhereNt = |Xt | with Xt ⊂ X

ThenFor feature xk , one can use αkn with n = 1, 2, ...,Ntk and Ntk ≤ Nt whereαkn are taken halfway between consecutive distinct values of xk in thetraining subset Xt .

29 / 47

Page 64: 13 Machine Learning Supervised Decision Trees

For example

29 / 47

Page 65: 13 Machine Learning Supervised Decision Trees

For example

29 / 47

Page 66: 13 Machine Learning Supervised Decision Trees

ThenWe repeat this with all featuresIn such a case, the total number of candidate questions is

l∑k=1

Ntk (1)

HoweverOnly one of them has to be chosen to provide the binary split at thecurrent node, t, of the tree.

ThusThis is selected to be the one that leads to the best split of theassociated subset Xt .The best split is decided according to a splitting criterion.

30 / 47

Page 67: 13 Machine Learning Supervised Decision Trees

l∑k=1

Ntk (1)

30 / 47

Page 68: 13 Machine Learning Supervised Decision Trees

l∑k=1

Ntk (1)

30 / 47

Page 69: 13 Machine Learning Supervised Decision Trees

l∑k=1

Ntk (1)

30 / 47

Page 70: 13 Machine Learning Supervised Decision Trees

Outline

31 / 47

Page 71: 13 Machine Learning Supervised Decision Trees

Criterions to be found

Splitting criterionA splitting criterion must be adopted according to which the bestsplit from the set of candidate ones is chosen.

Stop-splitting ruleA stop-splitting rule is required that controls the growth of the tree,and anode is declared as a terminal one (leaf).

RuleA rule is required that assigns each leaf to a specific class.

32 / 47

Page 72: 13 Machine Learning Supervised Decision Trees

32 / 47

Page 73: 13 Machine Learning Supervised Decision Trees

32 / 47

Page 74: 13 Machine Learning Supervised Decision Trees

Looking for Homogeneity!!!

In order for the tree growing methodologyFrom the root node down to the leaves every split must generate a subsetsthat are more homogeneous compared to the ancestor’s subset Xt .

MeaningThe training feature vectors in each one of the new subsets show, whereasdata in Xt are more equally distributed among the classes.

For exampleConsider the task of classifying four classes and assume that the vectors insubset Xt are distributed among the classes with equal probability.

33 / 47

Page 75: 13 Machine Learning Supervised Decision Trees

33 / 47

Page 76: 13 Machine Learning Supervised Decision Trees

33 / 47

Page 77: 13 Machine Learning Supervised Decision Trees

Thus

If we split the node soω1 and ω2 form XtY

ω3 and ω4 form XtN

ThenXtY and XtN are more homogeneous compared to Xt .

In other words“Purer” in the decision tree terminology.

34 / 47

Page 78: 13 Machine Learning Supervised Decision Trees

Thus

34 / 47

Page 79: 13 Machine Learning Supervised Decision Trees

Thus

34 / 47

Page 80: 13 Machine Learning Supervised Decision Trees

Our Goal

We needTo define a measure that quantifies node impurity.

ThusThe Overall Impurity of the descendant nodes is optimally decreased withrespect to the ancestor node’s impurity.

35 / 47

Page 81: 13 Machine Learning Supervised Decision Trees

Our Goal

We needTo define a measure that quantifies node impurity.

ThusThe Overall Impurity of the descendant nodes is optimally decreased withrespect to the ancestor node’s impurity.

35 / 47

Page 82: 13 Machine Learning Supervised Decision Trees

Outline

36 / 47

Page 83: 13 Machine Learning Supervised Decision Trees

Probabilistic Impurity

Assume the following probability of a vector in Xt belongs to class ωi

P(ωi |t) for i = 1, · · · ,M (2)

37 / 47

Page 84: 13 Machine Learning Supervised Decision Trees

Probabilistic Impurity

Assume the following probability of a vector in Xt belongs to class ωi

P(ωi |t) for i = 1, · · · ,M (2)

37 / 47

Page 85: 13 Machine Learning Supervised Decision Trees

A Common Impurity

We define one of the most common impurities

I (t) = −M∑

i=1P(ωi |t) log2 P(ωi |t)

This is nothing more than the Shannon’s Entropy!!!Facts:

I I (t)reaches its maximum when

P(ωi |t) = 1M

I I (t) = 0 if all data belongs to a single class i.e.P (ωi |t) = 1 for only one class, and P (ωj |t, j 6= i) = 0 for everybody

else.

38 / 47

Page 86: 13 Machine Learning Supervised Decision Trees

A Common Impurity

I (t) = −M∑

P(ωi |t) = 1M

else.

38 / 47

Page 87: 13 Machine Learning Supervised Decision Trees

A Common Impurity

I (t) = −M∑

P(ωi |t) = 1M

else.

38 / 47

Page 88: 13 Machine Learning Supervised Decision Trees

In reality...

We estimate

P (ωi |t) = N it

Nt

Where |ωi | = N it as the number of points in Xtthat belongs to class ωi .

Assume nowIf we perform a split, NtY points are sent into the “YES” node XtY andNtN into the “NO” node XtN

39 / 47

Page 89: 13 Machine Learning Supervised Decision Trees

In reality...

We estimate

P (ωi |t) = N it

Nt

Where |ωi | = N it as the number of points in Xtthat belongs to class ωi .

Assume nowIf we perform a split, NtY points are sent into the “YES” node XtY andNtN into the “NO” node XtN

39 / 47

Page 90: 13 Machine Learning Supervised Decision Trees

Decrease in node impurity

ThenIn a recursive way we define the term decrease in node impurity as:

∆I (t) = I (t)− NtYNt

I (tY )− NtNNt

I (tN ) (3)

where I (tY ) and I (tN ) are the impurities of the tY and tN nodes.

40 / 47

Page 91: 13 Machine Learning Supervised Decision Trees

The Final Goal

The Final GoalTo adopt from the set of candidate questions the one that performs thesplit with the highest decrease of impurity.

41 / 47

Page 92: 13 Machine Learning Supervised Decision Trees

Stop-Splitting Rule

NowThe natural question that now arises is when one decides to stop splittinga node and declares it as a leaf of the tree.

For example you can adoptA threshold T and stop splitting if the maximum value of ∆I (t) over allpossible splits is less than T .

Other posibilitiesIf the subset Xt is small enough.If the subset Xt is pure, in the sense that all points in it belong to asingle class.

42 / 47

Page 93: 13 Machine Learning Supervised Decision Trees

Stop-Splitting Rule

42 / 47

Page 94: 13 Machine Learning Supervised Decision Trees

Stop-Splitting Rule

42 / 47

Page 95: 13 Machine Learning Supervised Decision Trees

Stop-Splitting Rule

42 / 47

Page 96: 13 Machine Learning Supervised Decision Trees

Once a node is declared to be a leaf

Class Assignment RuleOnce a node is declared a leaf, we assign the leaf to a class using the rule:

j = arg maxi

P (ωi |t) .

43 / 47

Page 97: 13 Machine Learning Supervised Decision Trees

Outline

44 / 47

Page 98: 13 Machine Learning Supervised Decision Trees

Final Algorithm

Algorithm1 Begin with the root node, that is, Xt = X.

2 For each new node t

3 For every feature xk , k = 1, 2, ..., l:

4 For every value αkn , n = 1, 2, ...,Ntk

5 Generate XtY and XtN according to the answer in the question:

6 “Is xk (i) ≤ αkn ,” i = 1, 2, ...,Nt

7 Compute the impurity decrease

8 Choose αkn0 leading to the maximum decrease w.r. to xk .

9 Choose xk0 and associated αk0n0 leading to the overall maximum decrease of impurity.

10 If the stop-splitting rule is met, declare node t as a leaf and designate it with a class label

11 If not, generate two descendant nodes tY and tN with associated subsets XtY and XtN ,

12 depending on the answer to the question: is xk0 ≤ α?

45 / 47

Page 99: 13 Machine Learning Supervised Decision Trees

Final Algorithm

6 “Is xk (i) ≤ αkn ,” i = 1, 2, ...,Nt

45 / 47

Page 100: 13 Machine Learning Supervised Decision Trees

Final Algorithm

6 “Is xk (i) ≤ αkn ,” i = 1, 2, ...,Nt

45 / 47

Page 101: 13 Machine Learning Supervised Decision Trees

Final Algorithm

6 “Is xk (i) ≤ αkn ,” i = 1, 2, ...,Nt

45 / 47

Page 102: 13 Machine Learning Supervised Decision Trees

Final Algorithm

6 “Is xk (i) ≤ αkn ,” i = 1, 2, ...,Nt

45 / 47

Page 103: 13 Machine Learning Supervised Decision Trees

Final Algorithm

6 “Is xk (i) ≤ αkn ,” i = 1, 2, ...,Nt

45 / 47

Page 104: 13 Machine Learning Supervised Decision Trees

Final Algorithm

6 “Is xk (i) ≤ αkn ,” i = 1, 2, ...,Nt

45 / 47

Page 105: 13 Machine Learning Supervised Decision Trees

Final Algorithm

6 “Is xk (i) ≤ αkn ,” i = 1, 2, ...,Nt

45 / 47

Page 106: 13 Machine Learning Supervised Decision Trees

Final Algorithm

6 “Is xk (i) ≤ αkn ,” i = 1, 2, ...,Nt

45 / 47

Page 107: 13 Machine Learning Supervised Decision Trees

Final Algorithm

6 “Is xk (i) ≤ αkn ,” i = 1, 2, ...,Nt

45 / 47

Page 108: 13 Machine Learning Supervised Decision Trees

Final Algorithm

6 “Is xk (i) ≤ αkn ,” i = 1, 2, ...,Nt

45 / 47

Page 109: 13 Machine Learning Supervised Decision Trees

Outline

46 / 47

Page 110: 13 Machine Learning Supervised Decision Trees

Conclusions

RemarkDecision trees have emerged as one of the most popular methods ofclassification.

RemarkA variety of node impurity measures can be defined.

RemarkThe size of the three need to be controlled. The threshold T leadsincorrect sizes.

RemarkA drawback associated with tree classifiers is their high variance.Different training data set results in a very different tree.

47 / 47

Page 111: 13 Machine Learning Supervised Decision Trees

Conclusions

47 / 47

Page 112: 13 Machine Learning Supervised Decision Trees

Conclusions

47 / 47

Page 113: 13 Machine Learning Supervised Decision Trees

Conclusions

47 / 47