Data Mining 2021 Classification Trees (1)

Data Mining 2021Classification Trees (1)

Ad Feelders

Universiteit Utrecht

Ad Feelders ( Universiteit Utrecht ) Data Mining 1 / 40

Classification

Predict the class of an object on the basis of some of its attributes.For example, predict:

Good/bad credit for loan applicants, using

incomeage...

Spam/no spam for e-mail messages, using

% of words matching a given word (e.g. “free”)use of CAPITAL LETTERS...

Music Genre (Rock, Techno, Death Metal, ...) based on audiofeatures and lyrics.


Building a classification model

The basic idea is to build a classification model using a set of trainingexamples. Each training example contains attribute values and thecorresponding class label.

There are many techniques to do that:

Statistical Techniques

Discriminant AnalysisLogistic Regression

Data Mining/Machine Learning

Classification TreesBayesian Network ClassifiersNeural NetworksSupport Vector Machines...


Strong and Weak Points of Classification Trees

Strong points:

Are easy to interpret (if not too large).

Select relevant attributes automatically.

Can handle both numeric and categorical attributes.

Weak point:

Single trees are usually not among the top performers.

However:

Averaging multiple trees (bagging, random forests) can bring themback to the top!

But ease of interpretation suffers as a consequence.


Example: Loan Data

Record age married? own house income gender class

1 22 no no 28,000 male bad2 46 no yes 32,000 female bad3 24 yes yes 24,000 male bad4 25 no no 27,000 male bad5 29 yes yes 32,000 female bad6 45 yes yes 30,000 female good7 63 yes yes 58,000 male good8 36 yes no 52,000 male good9 23 no yes 40,000 female good

10 50 yes yes 28,000 female good


Credit Scoring Tree

5 5bad good

rec#

1…10

0 3

7,8,9

5 2

1…6,10

1 2

2,6,10

4 0

1,3,4,5

0 2

6,10

1 0

2

income > 36,000 income £ 36,000

age > 37 age £ 37

married not married


Cases with income > 36, 000

Record age married? own house income gender class

1 22 no no 28,000 male bad2 46 no yes 32,000 female bad3 24 yes yes 24,000 male bad4 25 no no 27,000 male bad5 29 yes yes 32,000 female bad6 45 yes yes 30,000 female good7 63 yes yes 58,000 male good8 36 yes no 52,000 male good9 23 no yes 40,000 female good

10 50 yes yes 28,000 female good


Partitioning the attribute space

30 40 50 60

30

40

50

bad

bad

bad

bad

bad

good

good

good

good

good

age

inco

me

36

37

Good

Bad


Why not split on gender in top node?

5 5bad good

rec#

1…10

2 3

2,5,6,9,10

gender = male gender = female

3 2

1,3,4,7,8

Intuitively: learning the value of gender doesn’t provide much informationabout the class label.


Impurity of a node

We strive towards nodes that are pure in the sense that they onlycontain observations of a single class.

We need a measure that indicates “how far” a node is removed fromthis ideal.

We call such a measure an impurity measure.


Impurity function

The impurity i(t) of a node t is a function of the relative frequencies ofthe classes in that node:

i(t) = φ(p1, p2, . . . , pJ)

where the pj(j = 1, . . . , J) are the relative frequencies of the J differentclasses in node t.

Sensible requirements of any quantification of impurity:

1 Should be at a maximum when the observations are distributed evenlyover all classes.

2 Should be at a minimum when all observations belong to a singleclass.

3 Should be a symmetric function of p1, . . . , pJ .


Quality of a split (test)

We define the quality of binary split s in node t as the reduction ofimpurity that it achieves

∆i(s, t) = i(t)− {π(`)i(`) + π(r)i(r)}

where ` is the left child of t, r is the right child of t, π(`) is the proportionof cases sent to the left, and π(r) the proportion of cases sent to the right.

t

` r

π(`) π(r)

i(t)

i(`) i(r)


Well known impurity functions

Impurity functions we consider:

Resubstitution error

Gini-index (CART, Rpart)

Entropy (C4.5, Rpart)



Measures the fraction of cases that is classified incorrectly if we assignevery case in node t to the majority class in that node. That is

i(t) = 1−maxj

p(j |t)

where p(j |t) is the relative frequency of class j in node t.


Resubstitution error: credit scoring tree

5 5

0 3

i = 0

5 2

1 2

i = 1/3

4 0

i = 0

0 2

i = 0

1 0

i = 0

i = 1/2

i = 2/7


Graph of resubstitution error for two-class case

p(0)

1-m

ax(p

(0),

1-p(

0))

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.0

0.1

0.2

0.3

0.4

0.5



Questions:

Does resubstitution error meet the sensible requirements?

What is the impurity reduction of the second split in the creditscoring tree if we use resubstitution error as impurity measure?


Impurity Reduction

5 5

0 3

i = 0

5 2

1 2

i = 1/3

4 0

i = 0

0 2

i = 0

1 0

i = 0

i = 1/2

i = 2/7

Impurity reduction of second split (using resubstitution error):

∆i(s, t) = i(t)− {π(`)i(`) + π(r)i(r)}

=2

7−(

3

7× 1

3+

4

7× 0

)=

2

7− 1

7=

1

7


Which split is better?

400 400

300 100 100 300

s1

400 400

200 400 200 0

s2

These splits have the same resubstitution error, but s2 is commonlypreferred because it creates a leaf node.


Class of suitable impurity functions

Problem: resubstitution error only decreases at a constant rate as thenode becomes purer.

We need an impurity measure which gives greater rewards to purernodes. Impurity should decrease at an increasing rate as the nodebecomes purer.

Hence, impurity should be a strictly concave function of p(0).

We define the class F of impurity functions (for two-class problems) thathas this property:

1 φ(0) = φ(1) = 0 (minimum at p(0) = 0 and p(0) = 1)

2 φ(p(0)) = φ(1− p(0)) (symmetric)

3 φ′′(p(0)) < 0, 0 < p(0) < 1 (strictly concave)


Impurity function: Gini index

For the two-class case the Gini index is

i(t) = p(0|t)p(1|t) = p(0|t)(1− p(0|t))

Question 1: Check that the Gini index belongs to F .

Question 2: Check that if we use the Gini index, split s2 is indeed preferred.

Note: The variance of a Bernoulli random variable with probability ofsuccess p is p(1− p). Hence we are attempting to minimize the varianceof the class distribution.


Gini index: credit scoring tree

5 5

0 3

i = 0

5 2

1 2

i = 2/9

4 0

i = 0

0 2

i = 0

1 0

i = 0

i = 1/4

i = 10/49


Can impurity increase?

x yz

g(z)

g(x)

g(y)

αg(x) + (1− α)g(y)

A concave function g . For any x and y , the line segment connecting g(x)and g(y) is below the graph of g . z = αx + (1− α)y .


Can impurity increase? Not if φ is concave.

p(0|`) p(0|r)

p(0|t) = π(`)p(0|`) + π(r)p(0|r)

φ(p(0|`))

φ(p(0|r))

φ(p(0|t))π(`)φ(p(0|`)) + π(r)φ(p(0|r))


Split s1 and s2 with resubstitution error


Split s1 and s2 with Gini


Impurity function: Entropy

For the two-class case the entropy is

i(t) = −p(0|t) log p(0|t)− p(1|t) log p(1|t)

Question: Check that entropy impurity belongs to F .

Remark: this is the average amount of information generated by drawing(with replacement) an example at random from this node, and observingits class.


Three (rescaled) impurity measures

p(0)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Entropy (solid), Gini (dot-dash) and resubstitution (dash) impurity.


The set of splits considered

1 Each split depends on the value of only a single attribute.

2 If attribute x is numeric, we consider all splits of type x ≤ c where cis (halfway) between two consecutive values of x in their sorted order.

3 If attribute x is categorical, taking values in {b1, b2, . . . , bL}, weconsider all splits of type x ∈ S , where S is any non-empty propersubset of {b1, b2, . . . , bL}.


Splits on numeric attributes

There is only a finite number of distinct splits, because there are at most ndistinct values of a numeric attribute in the training sample (where n is thenumber of examples in the training sample).

Example: possible splits on income in the root for the loan data

Income Class Quality (split after)0.25−

24 B 0.1(1)(0)+0.9(4/9)(5/9) = 0.0327 B 0.2(1)(0) + 0.8 (3/8)(5/8) = 0.0628 B,G 0.4(3/4)(1/4) + 0.6(2/6)(4/6) = 0.0430 G 0.5(3/5)(2/5) + 0.5(2/5)(3/5) = 0.0132 B,B 0.7(5/7)(2/7) + 0.3(0)(1) = 0.1140 G 0.8(5/8)(3/8) + 0.2(0)(1) = 0.0652 G 0.9(5/9)(4/9) + 0.1(0)(1) = 0.0358 G


Splits on a categorical attribute

For a categorical attribute with L distinct values there are 2L−1 − 1distinct splits to consider. Why?

There are 2L − 2 non-empty proper subsets of {b1, b2, . . . , bL}.

But a subset and the complement of that subset result in the same split,so we should divide this number by 2.


Splitting on categorical attributes: shortcut

For two-class problems, and φ ∈ F , we don’t have to check all 2L−1 − 1possible splits. Sort the p(0|x = b`), that is,

p(0|x = b`1) ≤ p(0|x = b`2) ≤ . . . ≤ p(0|x = b`L)

Then one of the L− 1 subsets

{b`1 , . . . , b`h}, h = 1, . . . , L− 1,

is the optimal split. Thus the search is reduced from computing 2L−1 − 1splits to computing only L− 1 splits.


Splitting on categorical attributes: example

Let x be a categorical attribute with possible values a, b, c , d . Suppose

p(0|x = a) = 0.6, p(0|x = b) = 0.4, p(0|x = c) = 0.2, p(0|x = d) = 0.8

Sort the values of x according to probability of class 0

c b a d

We only have to consider the splits: {c}, {c, b}, and {c , b, a}.

Intuition: put values with low probability of class 0 in one group,and values with high probability of class 0 in the other.


Splitting on numerical attributes: shortcut


24 B 0.1(1)(0)+0.9(4/9)(5/9) = 0.0327 B 0.2(1)(0) + 0.8 (3/8)(5/8) = 0.0628 B,G 0.4(3/4)(1/4) + 0.6(2/6)(4/6) = 0.0430 G 0.5(3/5)(2/5) + 0.5(2/5)(3/5) = 0.0132 B,B 0.7(5/7)(2/7) + 0.3(0)(1) = 0.1140 G 0.8(5/8)(3/8) + 0.2(0)(1) = 0.0652 G 0.9(5/9)(4/9) + 0.1(0)(1) = 0.0358 G

Optimal split can only occur between consecutive values with differentclass distributions.


Splitting on numerical attributes


24 B27 B 0.2(1)(0) + 0.8 (3/8)(5/8) = 0.0628 B,G 0.4(3/4)(1/4) + 0.6(2/6)(4/6) = 0.0430 G 0.5(3/5)(2/5) + 0.5(2/5)(3/5) = 0.0132 B,B 0.7(5/7)(2/7) + 0.3(0)(1) = 0.1140 G52 G58 G

Optimal split can only occur between consecutive values with differentclass distributions.


Segment borders: numeric example

A segment is a block of consecutive values of the split attribute for which the classdistribution is identical. Optimal splits can only occur at segment borders.

Consider the following data on numeric attribute x and class label y .The class label can take on two different values, coded as A and B.

x 8 8 12 12 14 16 16 18 20 20

y A B A B A A A A A B

The class probabilities (relative frequencies) are:

x 8 12 14 16 18 20

P(A) 0.5 0.5 1 1 1 0.5P(B) 0.5 0.5 0 0 0 0.5

So we obtain the segments: (8, 12), (14, 16, 18) and (20).Only consider the splits: x ≤ 13 and x ≤ 19Ignore: x ≤ 10, x ≤ 15 and x ≤ 17


Caveat

1 In the first practical assignment we use the parametersnmin and minleaf to stop tree growing early.

2 A split is not allowed to produce a child node withless than minleaf observations.

3 The segment borders algorithm doesn’t combine very wellwith the minleaf constraint.

4 Better use the “brute force” approach in the assignment.


Basic Tree Construction Algorithm (control flow)

Construct treenodelist ← {{training data}}Repeat

current node ← select node from nodelistnodelist ← nodelist − current nodeif impurity(current node) > 0then

S ← set of candidate splits in current nodes* ← arg maxs∈S impurity reduction(s,current node)child nodes ← apply(s*,current node)nodelist ← nodelist ∪ child nodes

fiUntil nodelist = ∅


Practical Assignment 1

Teams of 3 students.

Program a classification tree algorithm from scratch.

In Python or R.

Use of libraries for general data processing is allowed(e.g. NumPy in Python, dplyr in R).

http://www.cs.uu.nl/docs/vakken/mdm/practicum.html

Start with “Getting started with ...”.


Data Mining 2021 Classification Trees (1)

Documents

Transcript of Data Mining 2021 Classification Trees (1)