COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of...

64
COMP 551 – Applied Machine Learning Lecture 10: Decision trees Unless otherwise noted, all material posted for this course are copyright of the instructors, and cannot be reused or reposted without the instructors’ written permission. Instructor: Herke van Hoof ([email protected]) Slides mostly by: Joelle Pineau Class web page: www.cs.mcgill.ca/~hvanho2/comp551

Transcript of COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of...

Page 1: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

COMP 551 – Applied Machine LearningLecture 10: Decision trees

Unless otherwise noted, all material posted for this course are copyright of the instructors, and cannot be reused or reposted without the instructors’ written permission.

Instructor: Herke van Hoof ([email protected])

Slides mostly by: Joelle Pineau

Class web page: www.cs.mcgill.ca/~hvanho2/comp551

Page 2: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau2

Homework deadline

• Is moved to the 12th since it was released late

• Make sure you understand the problem before the weekend as

you’ll still be able to ask for help until Friday

– TA & instructor office hours

– MyCourses

– To make sure all information is available to everone, I won’t answerproject questions by e-mail (exception: personal issues)

COMP-551: Applied Machine Learning

Page 3: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau3

Back to machine learning: A quick recap

• Linear regression: Fit a linear function from input data to output.

• Linear classification: Find a linear hyper-plane separating

classes.

• Many problems require more sophisticated models!

COMP-551: Applied Machine Learning

Page 4: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau4

Today’s lecture

• Wrap-up SVMs

• New model for classification and regression:

Decision trees

COMP-551: Applied Machine Learning

Page 5: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau5

Kernelizing other ML algorithms

• Many other machine learning algorithms have a “dual formulation”,

in which dot-products of features can be replaced by kernels.

• Examples:

– Perceptron

– Logistic regression

– Linear regression (We’ll do this later in the course!)

COMP-551: Applied Machine Learning

Page 6: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau6

Multiple classes

• One-vs-All: Learn K separate binary classifiers.

– Can lead to inconsistent results.

– Training sets are imbalanced, e.g. assuming n examples per class, each binary classifier is trained with positive class having 1*n of the data, and negative class having (K-1)*n of the data.

COMP-551: Applied Machine Learning

Page 7: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau7

Multiple classes

• One-vs-All: Learn K separate binary classifiers.

– Can lead to inconsistent results.

– Training sets are imbalanced, e.g. assuming n examples per class, each binary classifier is trained with positive class having 1*n of the data, and negative class having (K-1)*n of the data.

• Multi-class SVM: Define the margin to be the gap between the

correct class and the nearest other class.

COMP-551: Applied Machine Learning

Page 8: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau8

SVMs for regression• Minimize a regularized error function:

ŵ = argminw C ∑i :1:n ( yi- wTxi )2 + ½||w||2

• Introduce slack variables to optimize “tube” around the regression function.

COMP-551: Applied Machine Learning

Page 9: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau9

SVMs for regression• Minimize a regularized error function:

ŵ = argminw C ∑i :1:n ( yi- wTxi )2 + ½||w||2

• Introduce slack variables to optimize “tube” around the regression function.

• Typically, relax to ε-sensitive error on the linear target to ensure sparse solution (i.e. few support vectors):

ŵ = argminw C ∑i :1:n Eε ( yi- wTxi )2 + ½||w||2

where Eε = 0 if ( yi- wTxi )<ε,(yi- wTxi) – ε otherwise

COMP-551: Applied Machine Learning

Page 10: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau10COMP-551: Applied Machine Learning

What you should know

From last class and from today:

• Perceptron algorithm.

• Margin definition for linear SVMs.

• Use of Lagrange multipliers to transform optimization problems.

• Primal and dual optimization problems for SVMs.

• Feature space version of SVMs.

• The kernel trick and examples of common kernels.

Page 11: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau11

Main types of machine learning problems

• Supervised learning

– Classification

– Regression

• Unsupervised learning

• Reinforcement learning

COMP-551: Applied Machine Learning

Decision trees

Page 12: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau12

Richer partitioning of the input space

COMP-551: Applied Machine Learning

Image from http://www.euclideanspace.com

Page 13: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau13

Example: Cancer Outcome Prediction

• Researchers computed 30 different features (or attributes) of the

cancer cells’ nuclei

• Vectors in the form ( (x1,x2,x3,……) , y ) = ( X , y )

• Data set D in the form of a data table:

COMP-551: Applied Machine Learning

o Features relate to radius, texture, perimeter, smoothness, concavity, etc. of the nucleio For each image, mean, standard error, and max of these properties across nuclei

Page 14: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau14

Decision tree example

COMP-551: Applied Machine Learning

R = recurrenceN = no recurrence

• What does a node represent?

– A partitioning of the input space.

• Internal nodes are tests on the values of different features.• Don’t need to be binary test.

Page 15: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau15

Decision tree example

• What does a node represent?

– A partitioning of the input space.

• Internal nodes are tests on the values of different features.• Don’t need to be binary test.

• Leaf nodes include the set of training examples that satisfy the tests along the branch.

• Each training example falls in precisely one leaf.• Each leaf typical contains more than one example.

COMP-551: Applied Machine Learning

R = recurrenceN = no recurrence

Page 16: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau16

Using decision trees for classification

• Suppose we get a new instance:

radius=18, texture=12, …

How do we classify it?

COMP-551: Applied Machine Learning

R = recurrenceN = no recurrence

Page 17: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau17

Using decision trees for classification

• Suppose we get a new instance:

radius=18, texture=12, …

How do we classify it?

• Simple procedure:

– At every node, test the corresponding attribute

– Follow the appropriate branch of the tree

– At a leaf, either predict the class of the majority of the

examples for that leaf, or sample from the probabilities of the

two classes.

COMP-551: Applied Machine Learning

R = recurrence

N = no recurrence

Page 18: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau18

Interpreting decision trees

Can always convert a decision tree into an

equivalent set of if-then rules.

COMP-551: Applied Machine Learning

R = recurrenceN = no recurrence

Page 19: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau19

Interpreting decision trees

Can always convert a decision tree into an

equivalent set of if-then rules.

Can also calculate an estimated

probability of recurrence.

COMP-551: Applied Machine Learning

R = recurrenceN = no recurrence

Page 20: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau20

More Formally : Tests

• Each internal node contains a test.

– Test typically depends on only one feature.

– For discrete features, we typically branch on all possibilities

– For real features, we typically branch on a threshold value

COMP-551: Applied Machine Learning

Page 21: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau21

More Formally : Tests

• Each internal node contains a test.

– Test typically depends on only one feature.

– For discrete features, we typically branch on all possibilities

– For real features, we typically branch on a threshold value

• Test returns a discrete outcome, e.g.

• Learning = choosing tests at every node and shape of the tree.

– A finite set of candidate tests is chosen before learning the tree.

COMP-551: Applied Machine Learning

Page 22: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau22

Example

COMP-551: Applied Machine Learning

14.4. Tree-based Models 663

Figure 14.4 Comparison of the squared error(green) with the absolute error (red)showing how the latter places muchless emphasis on large errors andhence is more robust to outliers andmislabelled data points.

0 z

E(z)

−1 1

can be addressed by basing the boosting algorithm on the absolute deviation |y − t|instead. These two error functions are compared in Figure 14.4.

14.4. Tree-based Models

There are various simple, but widely used, models that work by partitioning theinput space into cuboid regions, whose edges are aligned with the axes, and thenassigning a simple model (for example, a constant) to each region. They can beviewed as a model combination method in which only one model is responsiblefor making predictions at any given point in input space. The process of selectinga specific model, given a new input x, can be described by a sequential decisionmaking process corresponding to the traversal of a binary tree (one that splits intotwo branches at each node). Here we focus on a particular tree-based frameworkcalled classification and regression trees, or CART (Breiman et al., 1984), althoughthere are many other variants going by such names as ID3 and C4.5 (Quinlan, 1986;Quinlan, 1993).

Figure 14.5 shows an illustration of a recursive binary partitioning of the inputspace, along with the corresponding tree structure. In this example, the first step

Figure 14.5 Illustration of a two-dimensional in-put space that has been partitionedinto five regions using axis-alignedboundaries.

A

B

C D

E

θ1 θ4

θ2

θ3

x1

x2

Page 23: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau23

Example

COMP-551: Applied Machine Learning

14.4. Tree-based Models 663

Figure 14.4 Comparison of the squared error(green) with the absolute error (red)showing how the latter places muchless emphasis on large errors andhence is more robust to outliers andmislabelled data points.

0 z

E(z)

−1 1

can be addressed by basing the boosting algorithm on the absolute deviation |y − t|instead. These two error functions are compared in Figure 14.4.

14.4. Tree-based Models

There are various simple, but widely used, models that work by partitioning theinput space into cuboid regions, whose edges are aligned with the axes, and thenassigning a simple model (for example, a constant) to each region. They can beviewed as a model combination method in which only one model is responsiblefor making predictions at any given point in input space. The process of selectinga specific model, given a new input x, can be described by a sequential decisionmaking process corresponding to the traversal of a binary tree (one that splits intotwo branches at each node). Here we focus on a particular tree-based frameworkcalled classification and regression trees, or CART (Breiman et al., 1984), althoughthere are many other variants going by such names as ID3 and C4.5 (Quinlan, 1986;Quinlan, 1993).

Figure 14.5 shows an illustration of a recursive binary partitioning of the inputspace, along with the corresponding tree structure. In this example, the first step

Figure 14.5 Illustration of a two-dimensional in-put space that has been partitionedinto five regions using axis-alignedboundaries.

A

B

C D

E

θ1 θ4

θ2

θ3

x1

x2664 14. COMBINING MODELS

Figure 14.6 Binary tree corresponding to the par-titioning of input space shown in Fig-ure 14.5.

x1 > θ1

x2 > θ3

x1 ! θ4

x2 ! θ2

A B C D E

divides the whole of the input space into two regions according to whether x1 ! θ1

or x1 > θ1 where θ1 is a parameter of the model. This creates two subregions, eachof which can then be subdivided independently. For instance, the region x1 ! θ1

is further subdivided according to whether x2 ! θ2 or x2 > θ2, giving rise to theregions denoted A and B. The recursive subdivision can be described by the traversalof the binary tree shown in Figure 14.6. For any new input x, we determine whichregion it falls into by starting at the top of the tree at the root node and followinga path down to a specific leaf node according to the decision criteria at each node.Note that such decision trees are not probabilistic graphical models.

Within each region, there is a separate model to predict the target variable. Forinstance, in regression we might simply predict a constant over each region, or inclassification we might assign each region to a specific class. A key property of tree-based models, which makes them popular in fields such as medical diagnosis, forexample, is that they are readily interpretable by humans because they correspondto a sequence of binary decisions applied to the individual input variables. For in-stance, to predict a patient’s disease, we might first ask “is their temperature greaterthan some threshold?”. If the answer is yes, then we might next ask “is their bloodpressure less than some threshold?”. Each leaf of the tree is then associated with aspecific diagnosis.

In order to learn such a model from a training set, we have to determine thestructure of the tree, including which input variable is chosen at each node to formthe split criterion as well as the value of the threshold parameter θi for the split. Wealso have to determine the values of the predictive variable within each region.

Consider first a regression problem in which the goal is to predict a single targetvariable t from a D-dimensional vector x = (x1, . . . , xD)T of input variables. Thetraining data consists of input vectors {x1, . . . ,xN} along with the correspondingcontinuous labels {t1, . . . , tN}. If the partitioning of the input space is given, and weminimize the sum-of-squares error function, then the optimal value of the predictivevariable within any given region is just given by the average of the values of tn forthose data points that fall in that region.Exercise 14.10

Now consider how to determine the structure of the decision tree. Even for afixed number of nodes in the tree, the problem of determining the optimal structure(including choice of input variable for each split as well as the corresponding thresh-

Page 24: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau24

Expressivity of decision trees

What kind of function can be encoded via a decision tree?

COMP-551: Applied Machine Learning

Page 25: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau25

Expressivity of decision trees

What kind of function can be encoded via a decision tree?

• Every Boolean Function can be fully expressed

– Each entry in truth table could be one path (very inefficient!)

– Most boolean functions can be encoded more compactly.

COMP-551: Applied Machine Learning

Page 26: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau26

Expressivity of decision trees

What kind of function can be encoded via a decision tree?

• Every Boolean Function can be fully expressed

– Each entry in truth table could be one path (very inefficient!)

– Most boolean functions can be encoded more compactly.

COMP-551: Applied Machine Learning

x1

x2 x2

Y=t Y=t Y=t Y=f

f

f

t f t

t

Page 27: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau27

Expressivity of decision trees

What kind of function can be encoded via a decision tree?

• Every Boolean Function can be fully expressed

– Each entry in truth table could be one path (very inefficient!)

– Most boolean functions can be encoded more compactly.

COMP-551: Applied Machine Learning

x1

x2 x2

Y=t Y=t Y=t Y=f

f

f

t f t

t

x1

x2Y=t

Y=t Y=f

f

f

t

t

Page 28: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau28

Expressivity of decision trees

What kind of function can be encoded via a decision tree?

• Every Boolean Function can be fully expressed

– Each entry in truth table could be one path (very inefficient!)

– Most boolean functions can be encoded more compactly.

• Some functions are harder to encode

– Parity Function = Returns 1 iff an even number of inputs are 1• An exponentially big decision tree O(2M) would be needed

– Majority Function = Returns 1 if more than half the inputs are 1

COMP-551: Applied Machine Learning

Page 29: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau29

Expressivity of decision trees

What kind of function can be encoded via a decision tree?

• Every Boolean Function can be fully expressed

• Many other functions can be approximated by a Boolean function.

• With real-valued features, decision trees are good at problems in which the class label is constant in large connected axis-orthogonal regions of the input space.

COMP-551: Applied Machine Learning

Page 30: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau30

Learning Decision trees

• We could enumerate all possible trees (assuming the number of

possible tests is finite)

– Each tree could be evaluated using the training set or, better yet, a test set

– There are many possible trees! We’d probably overfit the data.

COMP-551: Applied Machine Learning

Page 31: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau31

Learning Decision trees

• We could enumerate all possible trees (assuming the number of

possible tests is finite)

– Each tree could be evaluated using the training set or, better

yet, a test set

– There are many possible trees! We’d probably overfit the data.

• Usually, decision trees are constructed in two phases:

1. A recursive, top-down procedure “grows” a tree (possibly until

the training data is completely fit)

2. The tree is “pruned” back to avoid overfitting

COMP-551: Applied Machine Learning

Page 32: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau32

Top-down (recursive) induction of decision trees

Given a set of labeled training instances:

1. If all the training instances have the same class, create a leaf with that class label and exit.

COMP-551: Applied Machine Learning

Page 33: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau33

Top-down (recursive) induction of decision trees

Given a set of labeled training instances:

1. If all the training instances have the same class, create a leaf with that class label and exit.

Else

2. Pick the best test to split the data on.

3. Split the training set according to the value of the outcome of the test.

4. Recursively repeat steps 1 - 3 on each subset of the training data.

How do we pick the best test?

COMP-551: Applied Machine Learning

Page 34: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau34

What is a good Test?

• The test should provide information about the class label.

E.g. You are given 40 examples: 30 positive, 10 negative

Consider two tests that would split the examples as follows: Which is best?

COMP-551: Applied Machine Learning

Page 35: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau35

What is a good Test?

• The test should provide information about the class label.

E.g. You are given 40 examples: 30 positive, 10 negative

Consider two tests that would split the examples as follows:

Which is best?

• Intuitively, we prefer an attribute that separates the training

instances as well as possible. How do we quantify this (mathematically)?

COMP-551: Applied Machine Learning

Page 36: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau36

What is information?

• Consider three cases:

– You are about to observe the outcome of a dice roll

– You are about to observe the outcome of a coin flip

– You are about to observe the outcome of a biased coin flip

• Intuitively, in each situation, you have a different amount of

uncertainty as to what outcome / message you will observe.

COMP-551: Applied Machine Learning

Page 37: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau37

Information content

• Let E be an event that occurs with probability P(E). If we are

told that E has occurred with certainty, then we received I(E)bits of information.

COMP-551: Applied Machine Learning

Page 38: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau38

Information content

• Let E be an event that occurs with probability P(E). If we are

told that E has occurred with certainty, then we received I(E)bits of information.

• You can also think of information as the amount of “surprise” in

the outcome (e.g., consider P(E) = 1, then I(E) ≈ 0 )E.g.

• fair coin flip provides log2 2 = 1 bit of information• fair dice roll provides log2 6 ≈ 2.58 bits of information

COMP-551: Applied Machine Learning

Page 39: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau39

Entropy

• Given an information source S which emits symbols from an

alphabet {s1 , . . . sk } with probabilities {p1 , . . . pk }.

• Each emission is independent of the others. What is the average

amount of information we expect from the output of S?

H(S) is the entropy of S.

• Note that this depends only on the probability distribution, and not on the actual alphabet. So we can write H(P).

COMP-551: Applied Machine Learning

Page 40: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau40

Entropy

• Several ways to think about entropy:

– Average amount of information per symbol.

– Average amount of surprise when observing the symbol.

– Uncertainty the observer has before seeing the symbol.

– Average number of bits needed to communicate the symbol.

COMP-551: Applied Machine Learning

Page 41: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau41

Binary Classification

• We try to classify sample data using a decision tree.

• Suppose we have p positive samples and n negative samples.

• What is the entropy of this data set?

COMP-551: Applied Machine Learning

Page 42: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau42

What is a good Test?

• The test should provide information about the class label.

E.g. You are given 40 examples: 30 positive, 10 negative

Consider two tests that would split the examples as follows: Which is best?

H(D) = -(3/4)log2(3/4)-(1/4)log2(1/4) = 0.811

COMP-551: Applied Machine Learning

Page 43: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau43

Conditional entropy

• If we know the value of x, we can express H(y|x=v),

• The conditional entropy, H(y|x), is the average specific conditional

entropy of y given the values of x:

• Interpretation: the expected number of bits needed to transmit y

if both the emitter and receiver know the possible values of x (but

before they are told x’s specific value.)

COMP-551: Applied Machine Learning

Example: Conditional entropy

x =HasKids y =OwnsDumboVideo

Yes Yes

Yes Yes

Yes Yes

Yes Yes

No No

No No

Yes No

Yes No

Specific conditional entropy is the uncertainty in y given a particular xvalue. E.g.,

• P (y = Y ES|x = Y ES) = 23, P (y = NO|x = Y ES) = 1

3

• H(y|x = Y ES) = 23 log

1(23)

+ 13 log

1(13)

⌥ 0.9183.

COMP-652, Lecture 8 - October 2, 2012 27

Conditional entropy

• The conditional entropy, H(y|x), is the average specific conditional

entropy of y given the values for x:

H(y|x) =X

v

P (x = v)H(y|x = v)

• E.g. (from previous slide),

– H(y|x = Y ES) = 23 log

1(23)

+ 13 log

1(13)

⌥ 0.6365

– H(y|x = NO) = 0 log 10 + 1 log 1

1 = 0.– H(y|x) = H(y|x = Y ES)P (x = Y ES) + H(y|x = NO)P (x =

NO) = 0.6365 ⇤ 34 + 0 ⇤ 1

4 ⌥ 0.4774

• Interpretation: the expected number of bits needed to transmit y if both

the emitter and the receiver know the possible values of x (but before

they are told x’s specific value).

COMP-652, Lecture 8 - October 2, 2012 28

H(y|x = v) = �X

s2SW

p(y = s|x = v) log p(y = s|x = v)<latexit sha1_base64="drVpunVvY6jpzEJvnqfspHrz0LA=">AAADOXichZJLbxMxEMed5VXCK4UjF4sIKZUg2kUg4BCpgkuP5RFSKRutvF4nterH1vamrFx/Er4KF67wCbhyQkic+AJ4H6Bmi8RIlv+e+c14bE2aM6pNGH7tBRcuXrp8Zetq/9r1GzdvDbZvv9OyUJhMsWRSHaRIE0YFmRpqGDnIFUE8ZWSWHr2s4rM1UZpK8daUOVlwtBJ0STEy3pUMnuyNytP3k/UOnMCHsS54YnVMBXwzczAflRPdBGMmV2fOyWAYjsPa4HkRtWIIWttPtns/40zighNhMENaz6MwNwuLlKGYEdePC01yhI/Qisy9FIgTvbD1+xy87z0ZXErllzCw9p7NsIhrXfLUkxyZQ92NVc5/xeaFWT5bWCrywhCBm4uWBYNGwuqzYEYVwYaVXiCsqO8V4kOkEDb+S/uxICdYco5EZuNUsmxNsJtHi+ZQNSSZjatbFbfDyDkH+zEjJvbcH36zSC4ZxaXzgo4qyiJ3Wu/a7bhNVJETpDLX1MeI2dcusW3OgzbHdXKOMz9Qzh6POly3dt5w+f84bQjKqnZ5kfxtucb8gETdcTgvpo/Gz8fhq8fD3RftpGyBu+AeGIEIPAW7YA/sgynA4AP4BD6DL8HH4FvwPfjRoEGvzbkDNiz49RsxpBJe</latexit><latexit sha1_base64="drVpunVvY6jpzEJvnqfspHrz0LA=">AAADOXichZJLbxMxEMed5VXCK4UjF4sIKZUg2kUg4BCpgkuP5RFSKRutvF4nterH1vamrFx/Er4KF67wCbhyQkic+AJ4H6Bmi8RIlv+e+c14bE2aM6pNGH7tBRcuXrp8Zetq/9r1GzdvDbZvv9OyUJhMsWRSHaRIE0YFmRpqGDnIFUE8ZWSWHr2s4rM1UZpK8daUOVlwtBJ0STEy3pUMnuyNytP3k/UOnMCHsS54YnVMBXwzczAflRPdBGMmV2fOyWAYjsPa4HkRtWIIWttPtns/40zighNhMENaz6MwNwuLlKGYEdePC01yhI/Qisy9FIgTvbD1+xy87z0ZXErllzCw9p7NsIhrXfLUkxyZQ92NVc5/xeaFWT5bWCrywhCBm4uWBYNGwuqzYEYVwYaVXiCsqO8V4kOkEDb+S/uxICdYco5EZuNUsmxNsJtHi+ZQNSSZjatbFbfDyDkH+zEjJvbcH36zSC4ZxaXzgo4qyiJ3Wu/a7bhNVJETpDLX1MeI2dcusW3OgzbHdXKOMz9Qzh6POly3dt5w+f84bQjKqnZ5kfxtucb8gETdcTgvpo/Gz8fhq8fD3RftpGyBu+AeGIEIPAW7YA/sgynA4AP4BD6DL8HH4FvwPfjRoEGvzbkDNiz49RsxpBJe</latexit><latexit sha1_base64="drVpunVvY6jpzEJvnqfspHrz0LA=">AAADOXichZJLbxMxEMed5VXCK4UjF4sIKZUg2kUg4BCpgkuP5RFSKRutvF4nterH1vamrFx/Er4KF67wCbhyQkic+AJ4H6Bmi8RIlv+e+c14bE2aM6pNGH7tBRcuXrp8Zetq/9r1GzdvDbZvv9OyUJhMsWRSHaRIE0YFmRpqGDnIFUE8ZWSWHr2s4rM1UZpK8daUOVlwtBJ0STEy3pUMnuyNytP3k/UOnMCHsS54YnVMBXwzczAflRPdBGMmV2fOyWAYjsPa4HkRtWIIWttPtns/40zighNhMENaz6MwNwuLlKGYEdePC01yhI/Qisy9FIgTvbD1+xy87z0ZXErllzCw9p7NsIhrXfLUkxyZQ92NVc5/xeaFWT5bWCrywhCBm4uWBYNGwuqzYEYVwYaVXiCsqO8V4kOkEDb+S/uxICdYco5EZuNUsmxNsJtHi+ZQNSSZjatbFbfDyDkH+zEjJvbcH36zSC4ZxaXzgo4qyiJ3Wu/a7bhNVJETpDLX1MeI2dcusW3OgzbHdXKOMz9Qzh6POly3dt5w+f84bQjKqnZ5kfxtucb8gETdcTgvpo/Gz8fhq8fD3RftpGyBu+AeGIEIPAW7YA/sgynA4AP4BD6DL8HH4FvwPfjRoEGvzbkDNiz49RsxpBJe</latexit>

Page 44: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau44

What is a good Test?

• The test should provide information about the class label.

E.g. You are given 40 examples: 30 positive, 10 negative

Consider two tests that would split the examples as follows: Which is best?

COMP-551: Applied Machine Learning

T1: T2:

Page 45: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau45

What is a good Test?

• The test should provide information about the class label.

E.g. You are given 40 examples: 30 positive, 10 negative

Consider two tests that would split the examples as follows: Which is best?

H(D) = -(3/4)log2(3/4)-(1/4)log2(1/4) = 0.811H(D|T1) = (30/40)[-(20/30)log2(20/30)-(10/30)log2(10/30)]+(10/40)[0] = 0.688H(D|T2) = (22/40)[-(15/22)log2(15/22)-(7/22)log2(7/22)]

+ (18/40)[-(15/18)log2(15/18)-(3/18)log2(3/18)] = 0.788

COMP-551: Applied Machine Learning

T1: T2:

Page 46: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau46

Information gain

• The reduction in entropy that would be obtained by knowing x :

• Equivalently, suppose one has to transmit y. How many bits (on

average) would it save if both the transmitter and emitter knew x?

COMP-551: Applied Machine Learning

Page 47: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau47

Recursive learning of decision trees

Given a set of labeled training instances:

1. If all training instances have the same class, create a leaf with that class label and exit.

Else:

2. Pick the best test to split the data on.

3. Split the training set according to the value of the test outcome.

4. Recursively repeat steps 1 to 3 on each subset of training data.

How do we pick the best test?

COMP-551: Applied Machine Learning

Page 48: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau48

Recursive learning of decision trees

Given a set of labeled training instances:

1. If all training instances have the same class, create a leaf with that class label and exit.

Else:

2. Pick the best test to split the data on.

3. Split the training set according to the value of the test outcome.

4. Recursively repeat steps 1 to 3 on each subset of training data.

How do we pick the best test?For classification: choose the test with highest information gain.For regression: choose the test with lowest mean-squared error.

COMP-551: Applied Machine Learning

Page 49: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau49

Caveats on tests with multiple values

• If the outcome of a test is not binary, the number of possible values influences the information gain.– The more possible values, the higher the gain!– Nonetheless, the attribute could be irrelevant

• Could transform attribute in one (or many) binary attributes.

COMP-551: Applied Machine Learning

Page 50: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau50

Caveats on tests with multiple values

• If the outcome of a test is not binary, the number of possible values

influences the information gain.

– The more possible values, the higher the gain!

– Nonetheless, the attribute could be irrelevant

• Could transform attribute in one (or many) binary attributes.

• C4.5 (the most popular decision tree construction algorithm in ML

community) uses only binary tests:

– Attribute =Value (discrete) or Attribute<Value (continuous)

• Other approaches consider smarter metrics which account for the

number of possible outcomes.

COMP-551: Applied Machine Learning

Page 51: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau51

Tests for real-valued features

• Suppose feature j is real-valued

• How do we choose a finite set of possible thresholds, for tests of

the form xj > τ ?

COMP-551: Applied Machine Learning

Page 52: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau52

Tests for real-valued features

• Suppose feature j is real-valued

• How do we choose a finite set of possible thresholds, for tests of

the form xj > τ ?

• Choose midpoints of the observed data values, x1,j , . . . , xm,j

• Choose midpoints of data values with different y values

COMP-551: Applied Machine Learning

Page 53: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau53

A complete (artificial) example

• An artificial binary classification problem with two real-valued

input features:

COMP-551: Applied Machine Learning

Page 54: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau54

A complete (artificial) example

• The decision tree, graphically:

COMP-551: Applied Machine Learning

Page 55: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau55

0.4

0.5

0.6

0.7

0.8

0.9

1

0 20 40 60 80 100

Pro

port

ion c

orr

ect

on t

est

set

Training set size

Assessing Performance

• Split data into a training set and a validation set

• Apply learning algorithm to training set.

• Measure error with the validation set.

COMP-551: Applied Machine Learning

From R&N , p.703

Page 56: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau56

Overfitting in decision trees

• Decision tree construction proceeds until all leaves are “pure” , i.e. all examples are from the same class.

• As the tree grows, the generalization performance can start to degrade, because the algorithm is including irrelevant attributes/tests/outliers.

• How can we avoid this?

COMP-551: Applied Machine Learning

Page 57: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau57

Avoiding Overfitting

• Objective: Remove some nodes to get better generalization.

COMP-551: Applied Machine Learning

Page 58: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau58

Avoiding Overfitting

• Objective: Remove some nodes to get better generalization.

• Early stopping: Stop growing the tree when further splitting the data does not improve information gain of the validation set.

• Post pruning: Grow a full tree, then prune the tree by eliminating lower nodes that have low information gain on the validation set.

COMP-551: Applied Machine Learning

Page 59: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau59

Avoiding Overfitting

• Objective: Remove some nodes to get better generalization.

• Early stopping: Stop growing the tree when further splitting the data does not improve information gain of the validation set.

• Post pruning: Grow a full tree, then prune the tree by eliminating lower nodes that have low information gain on the validation set.

• In general, post pruning is better. It allows you to deal with cases where a single attribute is not informative, but a combination of attributes is informative.

COMP-551: Applied Machine Learning

Page 60: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau60

Reduced-error pruning (aka post pruning)

• Split the data set into a training set and a validation set

• Grow a large tree (e.g. until each leaf is pure)

• For each node:

- Evaluate the validation set accuracy of pruning the subtreerooted at the node

- Greedily remove the node that most improves validation set accuracy, with its corresponding subtree

- Replace the removed node by a leaf with the majority class of the corresponding examples

• Stop when pruning starts hurting the accuracy on validation set.

COMP-551: Applied Machine Learning

Page 61: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau61

Example: Effect of reduced-error pruning

COMP-551: Applied Machine Learning

Page 62: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau62

Advantages of decision trees

• Provide a general representation of classification rules

– Scaling / normalization not needed, as we use no notion of “distance” between examples.

• The learned function, y = h(x), is easy to interpret.

• Fast learning algorithms (e.g. C4.5, CART).

• Good accuracy in practice – many applications in industry!

COMP-551: Applied Machine Learning

Page 63: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau63

Limitations

• Sensitivity:

– Exact tree output may be sensitive to small changes in data

– With many features, tests may not be meaningful

• Good for learning (nonlinear) piecewise axis-orthogonal decision

boundaries, but not for learning functions with smooth,

curvilinear boundaries.

– Some extensions use non-linear tests in internal nodes.

COMP-551: Applied Machine Learning

Page 64: COMP 551 –Applied Machine Learning Lecture 10: Decision trees · 2018-02-06 · Main types of machine learning problems • Supervised learning – Classification – Regression

Joelle Pineau64COMP-551: Applied Machine Learning

What you should know

• How to use a decision tree to classify a new example.

• How to build a decision tree using an information-theoretic

approach.

• How to detect (and fix) overfitting in decision trees.

• How to handle both discrete and continuous attributes and

outputs.