Data Mining 2021 Classification Trees (1)

44
Data Mining 2021 Classification Trees (1) Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Data Mining 1 / 40

Transcript of Data Mining 2021 Classification Trees (1)

Page 1: Data Mining 2021 Classification Trees (1)

Data Mining 2021Classification Trees (1)

Ad Feelders

Universiteit Utrecht

Ad Feelders ( Universiteit Utrecht ) Data Mining 1 / 40

Page 2: Data Mining 2021 Classification Trees (1)

Classification

Predict the class of an object on the basis of some of its attributes.For example, predict:

Good/bad credit for loan applicants, using

incomeage...

Spam/no spam for e-mail messages, using

% of words matching a given word (e.g. “free”)use of CAPITAL LETTERS...

Music Genre (Rock, Techno, Death Metal, ...) based on audiofeatures and lyrics.

Ad Feelders ( Universiteit Utrecht ) Data Mining 2 / 40

Page 3: Data Mining 2021 Classification Trees (1)

Building a classification model

The basic idea is to build a classification model using a set of trainingexamples. Each training example contains attribute values and thecorresponding class label.

There are many techniques to do that:

Statistical Techniques

Discriminant AnalysisLogistic Regression

Data Mining/Machine Learning

Classification TreesBayesian Network ClassifiersNeural NetworksSupport Vector Machines...

Ad Feelders ( Universiteit Utrecht ) Data Mining 3 / 40

Page 4: Data Mining 2021 Classification Trees (1)

Strong and Weak Points of Classification Trees

Strong points:

Are easy to interpret (if not too large).

Select relevant attributes automatically.

Can handle both numeric and categorical attributes.

Weak point:

Single trees are usually not among the top performers.

However:

Averaging multiple trees (bagging, random forests) can bring themback to the top!

But ease of interpretation suffers as a consequence.

Ad Feelders ( Universiteit Utrecht ) Data Mining 4 / 40

Page 5: Data Mining 2021 Classification Trees (1)

Example: Loan Data

Record age married? own house income gender class

1 22 no no 28,000 male bad2 46 no yes 32,000 female bad3 24 yes yes 24,000 male bad4 25 no no 27,000 male bad5 29 yes yes 32,000 female bad6 45 yes yes 30,000 female good7 63 yes yes 58,000 male good8 36 yes no 52,000 male good9 23 no yes 40,000 female good

10 50 yes yes 28,000 female good

Ad Feelders ( Universiteit Utrecht ) Data Mining 5 / 40

Page 6: Data Mining 2021 Classification Trees (1)

Credit Scoring Tree

5 5bad good

rec#

1…10

0 3

7,8,9

5 2

1…6,10

1 2

2,6,10

4 0

1,3,4,5

0 2

6,10

1 0

2

income > 36,000 income £ 36,000

age > 37 age £ 37

married not married

Ad Feelders ( Universiteit Utrecht ) Data Mining 6 / 40

Page 7: Data Mining 2021 Classification Trees (1)

Cases with income > 36, 000

Record age married? own house income gender class

1 22 no no 28,000 male bad2 46 no yes 32,000 female bad3 24 yes yes 24,000 male bad4 25 no no 27,000 male bad5 29 yes yes 32,000 female bad6 45 yes yes 30,000 female good7 63 yes yes 58,000 male good8 36 yes no 52,000 male good9 23 no yes 40,000 female good

10 50 yes yes 28,000 female good

Ad Feelders ( Universiteit Utrecht ) Data Mining 7 / 40

Page 8: Data Mining 2021 Classification Trees (1)

Partitioning the attribute space

30 40 50 60

30

40

50

bad

bad

bad

bad

bad

good

good

good

good

good

age

inco

me

36

37

Good

Bad

Ad Feelders ( Universiteit Utrecht ) Data Mining 8 / 40

Page 9: Data Mining 2021 Classification Trees (1)

Why not split on gender in top node?

5 5bad good

rec#

1…10

2 3

2,5,6,9,10

gender = male gender = female

3 2

1,3,4,7,8

Intuitively: learning the value of gender doesn’t provide much informationabout the class label.

Ad Feelders ( Universiteit Utrecht ) Data Mining 9 / 40

Page 10: Data Mining 2021 Classification Trees (1)

Why not split on gender in top node?

5 5bad good

rec#

1…10

2 3

2,5,6,9,10

gender = male gender = female

3 2

1,3,4,7,8

Intuitively: learning the value of gender doesn’t provide much informationabout the class label.

Ad Feelders ( Universiteit Utrecht ) Data Mining 9 / 40

Page 11: Data Mining 2021 Classification Trees (1)

Impurity of a node

We strive towards nodes that are pure in the sense that they onlycontain observations of a single class.

We need a measure that indicates “how far” a node is removed fromthis ideal.

We call such a measure an impurity measure.

Ad Feelders ( Universiteit Utrecht ) Data Mining 10 / 40

Page 12: Data Mining 2021 Classification Trees (1)

Impurity function

The impurity i(t) of a node t is a function of the relative frequencies ofthe classes in that node:

i(t) = φ(p1, p2, . . . , pJ)

where the pj(j = 1, . . . , J) are the relative frequencies of the J differentclasses in node t.

Sensible requirements of any quantification of impurity:

1 Should be at a maximum when the observations are distributed evenlyover all classes.

2 Should be at a minimum when all observations belong to a singleclass.

3 Should be a symmetric function of p1, . . . , pJ .

Ad Feelders ( Universiteit Utrecht ) Data Mining 11 / 40

Page 13: Data Mining 2021 Classification Trees (1)

Quality of a split (test)

We define the quality of binary split s in node t as the reduction ofimpurity that it achieves

∆i(s, t) = i(t)− {π(`)i(`) + π(r)i(r)}

where ` is the left child of t, r is the right child of t, π(`) is the proportionof cases sent to the left, and π(r) the proportion of cases sent to the right.

t

` r

π(`) π(r)

i(t)

i(`) i(r)

Ad Feelders ( Universiteit Utrecht ) Data Mining 12 / 40

Page 14: Data Mining 2021 Classification Trees (1)

Well known impurity functions

Impurity functions we consider:

Resubstitution error

Gini-index (CART, Rpart)

Entropy (C4.5, Rpart)

Ad Feelders ( Universiteit Utrecht ) Data Mining 13 / 40

Page 15: Data Mining 2021 Classification Trees (1)

Resubstitution error

Measures the fraction of cases that is classified incorrectly if we assignevery case in node t to the majority class in that node. That is

i(t) = 1−maxj

p(j |t)

where p(j |t) is the relative frequency of class j in node t.

Ad Feelders ( Universiteit Utrecht ) Data Mining 14 / 40

Page 16: Data Mining 2021 Classification Trees (1)

Resubstitution error: credit scoring tree

5 5

0 3

i = 0

5 2

1 2

i = 1/3

4 0

i = 0

0 2

i = 0

1 0

i = 0

i = 1/2

i = 2/7

Ad Feelders ( Universiteit Utrecht ) Data Mining 15 / 40

Page 17: Data Mining 2021 Classification Trees (1)

Graph of resubstitution error for two-class case

p(0)

1-m

ax(p

(0),

1-p(

0))

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.0

0.1

0.2

0.3

0.4

0.5

Ad Feelders ( Universiteit Utrecht ) Data Mining 16 / 40

Page 18: Data Mining 2021 Classification Trees (1)

Resubstitution error

Questions:

Does resubstitution error meet the sensible requirements?

What is the impurity reduction of the second split in the creditscoring tree if we use resubstitution error as impurity measure?

Ad Feelders ( Universiteit Utrecht ) Data Mining 17 / 40

Page 19: Data Mining 2021 Classification Trees (1)

Resubstitution error

Questions:

Does resubstitution error meet the sensible requirements?

What is the impurity reduction of the second split in the creditscoring tree if we use resubstitution error as impurity measure?

Ad Feelders ( Universiteit Utrecht ) Data Mining 17 / 40

Page 20: Data Mining 2021 Classification Trees (1)

Impurity Reduction

5 5

0 3

i = 0

5 2

1 2

i = 1/3

4 0

i = 0

0 2

i = 0

1 0

i = 0

i = 1/2

i = 2/7

Impurity reduction of second split (using resubstitution error):

∆i(s, t) = i(t)− {π(`)i(`) + π(r)i(r)}

=2

7−(

3

7× 1

3+

4

7× 0

)=

2

7− 1

7=

1

7

Ad Feelders ( Universiteit Utrecht ) Data Mining 18 / 40

Page 21: Data Mining 2021 Classification Trees (1)

Which split is better?

400 400

300 100 100 300

s1

400 400

200 400 200 0

s2

These splits have the same resubstitution error, but s2 is commonlypreferred because it creates a leaf node.

Ad Feelders ( Universiteit Utrecht ) Data Mining 19 / 40

Page 22: Data Mining 2021 Classification Trees (1)

Which split is better?

400 400

300 100 100 300

s1

400 400

200 400 200 0

s2

These splits have the same resubstitution error, but s2 is commonlypreferred because it creates a leaf node.

Ad Feelders ( Universiteit Utrecht ) Data Mining 19 / 40

Page 23: Data Mining 2021 Classification Trees (1)

Class of suitable impurity functions

Problem: resubstitution error only decreases at a constant rate as thenode becomes purer.

We need an impurity measure which gives greater rewards to purernodes. Impurity should decrease at an increasing rate as the nodebecomes purer.

Hence, impurity should be a strictly concave function of p(0).

We define the class F of impurity functions (for two-class problems) thathas this property:

1 φ(0) = φ(1) = 0 (minimum at p(0) = 0 and p(0) = 1)

2 φ(p(0)) = φ(1− p(0)) (symmetric)

3 φ′′(p(0)) < 0, 0 < p(0) < 1 (strictly concave)

Ad Feelders ( Universiteit Utrecht ) Data Mining 20 / 40

Page 24: Data Mining 2021 Classification Trees (1)

Impurity function: Gini index

For the two-class case the Gini index is

i(t) = p(0|t)p(1|t) = p(0|t)(1− p(0|t))

Question 1: Check that the Gini index belongs to F .

Question 2: Check that if we use the Gini index, split s2 is indeed preferred.

Note: The variance of a Bernoulli random variable with probability ofsuccess p is p(1− p). Hence we are attempting to minimize the varianceof the class distribution.

Ad Feelders ( Universiteit Utrecht ) Data Mining 21 / 40

Page 25: Data Mining 2021 Classification Trees (1)

Gini index: credit scoring tree

5 5

0 3

i = 0

5 2

1 2

i = 2/9

4 0

i = 0

0 2

i = 0

1 0

i = 0

i = 1/4

i = 10/49

Ad Feelders ( Universiteit Utrecht ) Data Mining 22 / 40

Page 26: Data Mining 2021 Classification Trees (1)

Can impurity increase?

x yz

g(z)

g(x)

g(y)

αg(x) + (1− α)g(y)

A concave function g . For any x and y , the line segment connecting g(x)and g(y) is below the graph of g . z = αx + (1− α)y .

Ad Feelders ( Universiteit Utrecht ) Data Mining 23 / 40

Page 27: Data Mining 2021 Classification Trees (1)

Can impurity increase?

Is it possible that a split makes things worse, i.e. ∆i(s, t) < 0?

Not if φ ∈ F . Because φ is a concave function, we have

φ(p(0|`)π(`) + p(0|r)π(r)) ≥ π(`)φ(p(0|`)) + π(r)φ(p(0|r))

Sincep(0|t) = p(0|`)π(`) + p(0|r)π(r)

it follows that

φ(p(0|t)) ≥ π(`)φ(p(0|`)) + π(r)φ(p(0|r))

Ad Feelders ( Universiteit Utrecht ) Data Mining 24 / 40

Page 28: Data Mining 2021 Classification Trees (1)

Can impurity increase? Not if φ is concave.

p(0|`) p(0|r)

p(0|t) = π(`)p(0|`) + π(r)p(0|r)

φ(p(0|`))

φ(p(0|r))

φ(p(0|t))π(`)φ(p(0|`)) + π(r)φ(p(0|r))

Ad Feelders ( Universiteit Utrecht ) Data Mining 25 / 40

Page 29: Data Mining 2021 Classification Trees (1)

Split s1 and s2 with resubstitution error

Ad Feelders ( Universiteit Utrecht ) Data Mining 26 / 40

Page 30: Data Mining 2021 Classification Trees (1)

Split s1 and s2 with Gini

Ad Feelders ( Universiteit Utrecht ) Data Mining 27 / 40

Page 31: Data Mining 2021 Classification Trees (1)

Impurity function: Entropy

For the two-class case the entropy is

i(t) = −p(0|t) log p(0|t)− p(1|t) log p(1|t)

Question: Check that entropy impurity belongs to F .

Remark: this is the average amount of information generated by drawing(with replacement) an example at random from this node, and observingits class.

Ad Feelders ( Universiteit Utrecht ) Data Mining 28 / 40

Page 32: Data Mining 2021 Classification Trees (1)

Three (rescaled) impurity measures

p(0)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Entropy (solid), Gini (dot-dash) and resubstitution (dash) impurity.

Ad Feelders ( Universiteit Utrecht ) Data Mining 29 / 40

Page 33: Data Mining 2021 Classification Trees (1)

The set of splits considered

1 Each split depends on the value of only a single attribute.

2 If attribute x is numeric, we consider all splits of type x ≤ c where cis (halfway) between two consecutive values of x in their sorted order.

3 If attribute x is categorical, taking values in {b1, b2, . . . , bL}, weconsider all splits of type x ∈ S , where S is any non-empty propersubset of {b1, b2, . . . , bL}.

Ad Feelders ( Universiteit Utrecht ) Data Mining 30 / 40

Page 34: Data Mining 2021 Classification Trees (1)

Splits on numeric attributes

There is only a finite number of distinct splits, because there are at most ndistinct values of a numeric attribute in the training sample (where n is thenumber of examples in the training sample).

Example: possible splits on income in the root for the loan data

Income Class Quality (split after)0.25−

24 B 0.1(1)(0)+0.9(4/9)(5/9) = 0.0327 B 0.2(1)(0) + 0.8 (3/8)(5/8) = 0.0628 B,G 0.4(3/4)(1/4) + 0.6(2/6)(4/6) = 0.0430 G 0.5(3/5)(2/5) + 0.5(2/5)(3/5) = 0.0132 B,B 0.7(5/7)(2/7) + 0.3(0)(1) = 0.1140 G 0.8(5/8)(3/8) + 0.2(0)(1) = 0.0652 G 0.9(5/9)(4/9) + 0.1(0)(1) = 0.0358 G

Ad Feelders ( Universiteit Utrecht ) Data Mining 31 / 40

Page 35: Data Mining 2021 Classification Trees (1)

Splits on a categorical attribute

For a categorical attribute with L distinct values there are 2L−1 − 1distinct splits to consider. Why?

There are 2L − 2 non-empty proper subsets of {b1, b2, . . . , bL}.

But a subset and the complement of that subset result in the same split,so we should divide this number by 2.

Ad Feelders ( Universiteit Utrecht ) Data Mining 32 / 40

Page 36: Data Mining 2021 Classification Trees (1)

Splits on a categorical attribute

For a categorical attribute with L distinct values there are 2L−1 − 1distinct splits to consider. Why?

There are 2L − 2 non-empty proper subsets of {b1, b2, . . . , bL}.

But a subset and the complement of that subset result in the same split,so we should divide this number by 2.

Ad Feelders ( Universiteit Utrecht ) Data Mining 32 / 40

Page 37: Data Mining 2021 Classification Trees (1)

Splitting on categorical attributes: shortcut

For two-class problems, and φ ∈ F , we don’t have to check all 2L−1 − 1possible splits. Sort the p(0|x = b`), that is,

p(0|x = b`1) ≤ p(0|x = b`2) ≤ . . . ≤ p(0|x = b`L)

Then one of the L− 1 subsets

{b`1 , . . . , b`h}, h = 1, . . . , L− 1,

is the optimal split. Thus the search is reduced from computing 2L−1 − 1splits to computing only L− 1 splits.

Ad Feelders ( Universiteit Utrecht ) Data Mining 33 / 40

Page 38: Data Mining 2021 Classification Trees (1)

Splitting on categorical attributes: example

Let x be a categorical attribute with possible values a, b, c , d . Suppose

p(0|x = a) = 0.6, p(0|x = b) = 0.4, p(0|x = c) = 0.2, p(0|x = d) = 0.8

Sort the values of x according to probability of class 0

c b a d

We only have to consider the splits: {c}, {c, b}, and {c , b, a}.

Intuition: put values with low probability of class 0 in one group,and values with high probability of class 0 in the other.

Ad Feelders ( Universiteit Utrecht ) Data Mining 34 / 40

Page 39: Data Mining 2021 Classification Trees (1)

Splitting on numerical attributes: shortcut

Income Class Quality (split after)0.25−

24 B 0.1(1)(0)+0.9(4/9)(5/9) = 0.0327 B 0.2(1)(0) + 0.8 (3/8)(5/8) = 0.0628 B,G 0.4(3/4)(1/4) + 0.6(2/6)(4/6) = 0.0430 G 0.5(3/5)(2/5) + 0.5(2/5)(3/5) = 0.0132 B,B 0.7(5/7)(2/7) + 0.3(0)(1) = 0.1140 G 0.8(5/8)(3/8) + 0.2(0)(1) = 0.0652 G 0.9(5/9)(4/9) + 0.1(0)(1) = 0.0358 G

Optimal split can only occur between consecutive values with differentclass distributions.

Ad Feelders ( Universiteit Utrecht ) Data Mining 35 / 40

Page 40: Data Mining 2021 Classification Trees (1)

Splitting on numerical attributes

Income Class Quality (split after)0.25−

24 B27 B 0.2(1)(0) + 0.8 (3/8)(5/8) = 0.0628 B,G 0.4(3/4)(1/4) + 0.6(2/6)(4/6) = 0.0430 G 0.5(3/5)(2/5) + 0.5(2/5)(3/5) = 0.0132 B,B 0.7(5/7)(2/7) + 0.3(0)(1) = 0.1140 G52 G58 G

Optimal split can only occur between consecutive values with differentclass distributions.

Ad Feelders ( Universiteit Utrecht ) Data Mining 36 / 40

Page 41: Data Mining 2021 Classification Trees (1)

Segment borders: numeric example

A segment is a block of consecutive values of the split attribute for which the classdistribution is identical. Optimal splits can only occur at segment borders.

Consider the following data on numeric attribute x and class label y .The class label can take on two different values, coded as A and B.

x 8 8 12 12 14 16 16 18 20 20

y A B A B A A A A A B

The class probabilities (relative frequencies) are:

x 8 12 14 16 18 20

P(A) 0.5 0.5 1 1 1 0.5P(B) 0.5 0.5 0 0 0 0.5

So we obtain the segments: (8, 12), (14, 16, 18) and (20).Only consider the splits: x ≤ 13 and x ≤ 19Ignore: x ≤ 10, x ≤ 15 and x ≤ 17

Ad Feelders ( Universiteit Utrecht ) Data Mining 37 / 40

Page 42: Data Mining 2021 Classification Trees (1)

Caveat

1 In the first practical assignment we use the parametersnmin and minleaf to stop tree growing early.

2 A split is not allowed to produce a child node withless than minleaf observations.

3 The segment borders algorithm doesn’t combine very wellwith the minleaf constraint.

4 Better use the “brute force” approach in the assignment.

Ad Feelders ( Universiteit Utrecht ) Data Mining 38 / 40

Page 43: Data Mining 2021 Classification Trees (1)

Basic Tree Construction Algorithm (control flow)

Construct treenodelist ← {{training data}}Repeat

current node ← select node from nodelistnodelist ← nodelist − current nodeif impurity(current node) > 0then

S ← set of candidate splits in current nodes* ← arg maxs∈S impurity reduction(s,current node)child nodes ← apply(s*,current node)nodelist ← nodelist ∪ child nodes

fiUntil nodelist = ∅

Ad Feelders ( Universiteit Utrecht ) Data Mining 39 / 40

Page 44: Data Mining 2021 Classification Trees (1)

Practical Assignment 1

Teams of 3 students.

Program a classification tree algorithm from scratch.

In Python or R.

Use of libraries for general data processing is allowed(e.g. NumPy in Python, dplyr in R).

http://www.cs.uu.nl/docs/vakken/mdm/practicum.html

Start with “Getting started with ...”.

Ad Feelders ( Universiteit Utrecht ) Data Mining 40 / 40