Timu a11 Classification Decision Tree Induction

1

1

Tietämyksen muodostaminenKnowledge discovery

ClassificationDecision tree induction

Kati IltanenComputer SciencesSchool of Information SciencesUniversity of Tampere

2

2

Classification

§ Aim: to predict the value of a qualitative attribute§ class labels (the values of the target attribute, class) are predicted

§ Every case belongs to one of the mutually exclusive classes. Thisclass is known.§ supervised learning

§ The classification method classifies the training data based on theattribute values and class labels.§ constructs a model

3

3

Classification

§ The model is used for classifying new data.§ The model is evaluated (e.g. accuracy and subjective estimate)§ If the model is acceptable, it is used to classify cases whose class

labels are not known.§ new data (unknown data, previously unseen data)

§ Classification methods§ Decision trees, rules, k nearest neighbour method, naïve Bayesian

classifier, neural networks, ...

§ Application examples:§ To give a diagnosis suggestion on the basis of the symptoms and

test results of a patient§ To predict the paying capacity of a loan applicant

4

4

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

Learningalgorithm

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’ELSE tenured = ‘no’

Training data

Classifier (Model)

Constructing and testing a classifier

Knownclasslabels

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Test data

The model misclassifiesthe second test case(gives ’yes’, the knownclass is ’no’)

noyesyesyes

5

5

(Jeff, Professor, 4) Tenured? Yes

New data:

Using the classifier

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’ELSE tenured = ‘no’

Classifier (Model)

6

6

Decision tree induction

§ TDIDT (Top Down Induction of Decision Trees)

§ Inductive learning: general knowledge from separate cases§ Cases are described using fixedlength attribute vectors.§ Each case belongs to one class.§ Classes are mutually exclusive.§ The class of a case is known: supervised learning

§ Knowledge is represented in the form a decision tree.§ A decision tree is a classification model.

§ The tree is constructed in a topdown manner (from the root tothe leaves)

7

7

outlook

overcast

humidity windy

high normal falsetrue

sunny rain

N P

P

PN

Classes: P = play tennis, N = don’t play tennis (Quinlan 86)

Decision tree induction

Outlook Temperature Humidity Windy Classsunny hot high false Nsunny hot high true Novercast hot high false Prain mild high false Prain cool normal false Prain cool normal true Novercast cool normal true Psunny mild high false Nsunny cool normal false Prain mild normal false Psunny mild normal true Povercast mild high true Povercast hot normal false Prain mild high true N Decision tree

Training data: Saturday mornings

8

8

§ Decision tree

§ Inner nodes contain tests basedon attributes (test nodes)

§ Branches correspond to theoutcomes of the tests (attributevalues)

§ Leaf nodes (leaves) contain theclass information (one class orclass distribution)

Decision tree

outlook

overcast

humidity windy


sunny rain

N P

P

PN

9

9

§ Classification of a new case startsfrom the root of the tree.

§ The attribute assigned to the rootnode is examined and a branchcorresponding to the attributevalue is followed.

§ This process continues until a leafnode is encountered.§ The leaf predicts the class of the

new case.

Decision tree

outlook

overcast

humidity windy


sunny rain

N P

P

PN

10

10

Decision tree

§ The classification path from the root to a leaf gives an explanationfor the decision.

§ The number of tested attributes depends on the classificationpath.§ It is not necessary to test all the attributes in all the paths.

§ A classification path: a conjunction of constraints set on attributes

§ A decision tree: a disjunction of the classification paths

11

11

Building a decision tree

§ Building a decision tree is a two step process

§ Tree construction§ A complete (fullygrown) tree is built based on the training data.§ (prepruning: the growth of tree is restricted)

§ Tree pruning§ postpruning: branches are pruned from a complete tree (or from

a prepruned tree)

12

12

TDIDT: Basic algorithm

§ A decision tree is constructed in a topdown recursive divideandconquer manner.

§ In the beginning, all the training examples are at the root.

§ If the stopping criterion is fulfilled, a leaf node is formed.

§ If the stopping criterion is not fulfilled, the best attribute isselected according to some criterion (a greedy algorithm) and§ a test node is formed§ cases are divided into subsets according to values of the chosen

attribute§ a decision tree is formed recursively for each subset

13

13

DTIDT: Basic algorithm

Generate a decision tree(1) Create a node N

(2) if (stopping criterion is fulfilled)(3) Make a leaf node (node N)(4) else(5) Choose the best attribute and make a(6) test node (node N) that tests the chosen attribute

(7) Divide cases into subsets according to the(8) values of the chosen attribute

(9) Generate a decision tree for each subset

14

14

TDIDT: Key questions

§ How to select the best attribute?

§ How to specify the attribute test condition?§ How to form inner nodes and branches?

§ When to stop the recursive splitting?

§ How to form decision nodes (leaves)?

§ How to prune a tree?

15

15

Attribute selection criterion

§ How to select the best attribute?

§ Adequacy of attributes§ Attributes are adequate for the classification task, if all the cases

having the same attribute values belong to the same class.

§ If the attributes are adequate it is always possible to construct adecision tree which correctly classifies all the training data.§ Usually there exist several correctly classifying decision trees.

§ In the worst case, there is a leaf in the tree for each of the trainingcases.

16

16

Simple decision tree

A simple decision tree for the“Tennis playing” classificationtask

17

17

Complex decision tree

A complex decisiontree for the sameclassification task

18

18

Attribute selection criterion

§ The aim is to generate simple (small) decision trees.§ Derives from the principle called Occam’s razor:

§ If there are two models having the same accuracy on the training data, thesmaller one (simpler one) can be seen more general and thus better§ Smaller trees: more general, easier to understand and possibly more

accurate in classifying unseen cases

§ Try to generate simple trees by generating simple nodes.

§ The complexity of a node is§ in its largest when the node has an equal number of cases from every class

of the node

§ in its smallest when the node has cases from one class only

§ Heuristic attribute selection measures (measures of goodness of split)are used. These aim to generate homogeneous (pure) child nodes(subsets).

19

19

TDIDT algorithm family

§ CLS (Concept Learning System)§ E.B. Hunt (50’s and 60’s)§ To simulate human problem solving methods§ Analysing the content of English texts, medical diagnostics

§ ID3 (Iterative Dichotomizer 3)§ J.R. Quinlan (end of 70’s)§ Chess endgames§ Applications from medical diagnostics to scouting

§ Other early decision tree algorithms§ CART (Classification and Regression Trees) (84)§ Assistant (84)

§ C4.5, C5, See5§ descendants of ID3§ Addresses issues arising in real world classification tasks§ C4.5 is one of the most widely used machine learning algorithms, frequently

used as a reference algorithm in machine learning research

20

20

ID3

§ Assumes that§ attributes are categorical and have a small number of possible

values§ the class (the target attribute) has two possible values

§ applicable to classification tasks with two classes§ attributes are adequate§ data contain no missing values

§ ID3 selects the best attribute according to a criterion calledinformation gain§ Criterion selects an attribute that maximises information gain (or

minimises entropy)

21

21

§ Let§ S be a training set that contains s cases (s is the number of cases)§ the class attribute C have values C1 , … , Cm (m is the number of

classes)§ In ID3 m = 2

§ si be the number of cases belonging to the class Ci in the training setS and p(Ci) = si /s the relative frequency of the class Ci in S

ID3: Attribute selection criterion

22

22

§ The expected information needed to classify an arbitrary case inS (or entropy of C in S) is

∑−==

m

iii CpCpCH

12 )(log)()(

§ 2based logarithm, because the information is coded in bits

§ We define in this context that if p(Ci) = 0 then p(Ci) log2 p(Ci) returns 0 (zero)

++ ∈∈=⇔= RaRlkakxa xk },1{\,,log

laa kkl log:loglog =


23

23

C1 0

C2 6

C1 2

C2 4

C1 1

C2 5

p(C1) = 0/6 = 0 p(C2) = 6/6 = 1H(C) = –0 log2 0 –1 log2 1 = –0 –0 = 0

p(C1) = 1/6 p(C2) = 5/6H(C) = –(1/6) log2 (1/6) –(5/6) log2 (5/6) = 0.65

p(C1) = 2/6 p(C2) = 4/6H(C) = –(2/6) log2 (2/6) –(4/6) log2 (4/6) = 0.92


C1 3

C2 3p(C1) = 3/6 p(C2) = 3/6H(C) = –(3/6) log2 (3/6) –(3/6) log2 (3/6) = 1

§ Maximum (= log2 m) when cases are equally distributed among theclasses§ m = number of classes

§ Minimum (= 0) when all cases belong to the same class

24

24


§ Let an attribute A have the values Aj , j = 1,… ,v

§ Let the set S be divided into subsets {S1, S2 , … , Sv} according tothe values of the attribute A

§ The expected information needed to classify an arbitrary case inthe branch corresponding the value Aj is

§ Consider only those cases having the value Aj for the attribute A and calculate p(Ci) in the setof these cases

∑−==

m

ijijij ACpACpACH

12 )|(log)|()|(

25

25


§ The expected information needed to classify an arbitrary casewhen using the attribute A as root is

))H(C|Ap(AH(C|A)v

jjj∑=

=1

)|()( ACHCHI(C|A) −=

§ Information gained by branching on the attribute A is

§ p(Ai) is the relative frequency of the cases having value Aj for the attribute A in the set S

§ ID3 chooses the attribute resulting in the greatest information gain as theattribute for the root of the decision tree.

26

26

ID3: Tests

§ Tests in the inner nodes take the form of§ A = Aj

§ An attribute A has the value Aj

§ Outcomes of a test are mutually exclusive.

§ There is an own branch in the tree for each possible outcome .

27

27

ID3: Stopping criterion

§ ID3 assumes that attributes are adequate.

§ It splits the data in recursive fashion, until all the cases of a nodebelong to the same class.

§ The class of a leaf node is defined on the basis of the class of thecases in the node.§ If the leaf is empty (there are no cases with some particular value of

an attribute), the class is unknown (the leaf is labelled as ‘null’)

28

28

Outlook Temperature Humidity Windy Classsunny hot high false Nsunny hot high true Novercast hot high false Prain mild high false Prain cool normal false Prain cool normal true Novercast cool normal true Psunny mild high false Nsunny cool normal false Prain mild normal false Psunny mild normal true Povercast mild high true Povercast hot normal false Prain mild high true N

Playing tennis(Quinlan 86)

Classes:P = positive(play)N = negative(don’t play)

123456789

1011121314

Example: ID3 (1)

Cases:Saturdaymornings

29

29

Example: ID3 (2)

940.0145log

145

149log

149)( 22 =−−=CH

§ Class P (positive class): play tennis§ 9 cases

§ Class N (negative class): don’t play tennis§ 5 cases

§ The expected information needed to classify an arbitrary case inS is

30

30

outlook P N H(C|A )sunny 2 3 0.971overcast 4 0 0rain 3 2 0.971

971.053log

53

52log

52)|( 221 =−−=ACH

040log

40

44log

44)|( 222 =−−=ACH

971.052log

52

53log

53)|( 223 =−−=ACH

sunny:

overcast:

rain:

§ The expected informationrequired for each of the subtreesafter using the attribute Outlookto split the set S into 3 subsets

j

Example: ID3 (3)

31

31

Example: ID3 (4)

694.0971.0145

0144

971.0145

)|( =×+×+×=ACH

246.0694.0940.0)|()()|( =−=−= ACHCHACI

§ The expected information needed to classify an arbitrary casefor the tree with the attribute Outlook as root is

§ The information gained by branching on the attribute Outlook (A)is

32

32

Example: ID3 (5)

§ The information gain for other candidate attributes is calculatedsimilarly§ I(C|temperature) = 0.029§ I (C|humidity) = 0.151§ I (C|windy) = 0.048

§ The attribute resulting in the greatest information gain is chosenas the attribute for the root of the decision tree.§ I(C|outlook) = 0.246

33

33

outlook

overcastsunny rain

Cases(1, sunny, hot,… , N)(2, sunny, hot,… , N)(8, sunny, mild,… , N)(9, sunny, cool,… , P)(11, sunny, mild,… , P)

Cases(3, overcast, hot,… , P)(7, overcast, cool,… , P)(12, overcast, mild,… , P)(13, overcast, hot,… , P)

Cases(4, rain, mild,… , P)(5, rain, cool,… , P)(6, rain, cool,… , N)(10, rain, mild,… , P)(14, rain, mild,… , N)

§ The attribute Outlook has been chosen and the cases havebeen divided into subsets according to their values of theOutlook attribute.

Example: ID3 (6)

34

34

Example: ID3 (7)

§ The branch corresponding the outcome sunny is built next.

Cases:(1, sunny, hot, high, false, N)(2, sunny, hot, high, true, N)(8, sunny, mild, high, false, N)(9, sunny, cool, normal, false, P)(11, sunny, mild, normal, true, P)

971.053log

53

52log

52)( 22 =−−=CH

§ Calculate the expected information…

§ and the information gain for all candidate attributes...

36

36

Example: ID3 (9)

outlook

overcast

humidity

high normal

sunny rain

PN

§ Humidity is chosen

§ Cases are sent downto high and normalbranches

§ The cases in highbranch are all of thesame class: a leafnode is formed

§ The same situation inthe normal branch

§ Branches forovercast and rainare built in thesimilar way…

37

37

Example: ID3 (10)

§ Complete decision tree and classification of a new case

(Outlook: rain,Temperature: hot,Humidity: high,Windy: true)

Play tennis?

N

outlook

overcast

humidity windy


sunny rain

N P

P

PN

38

38

Real world classification tasks

§ Real world data can be mixed.§ Attributes may have different scales (both qualitative and

quantitative).

§ Data may contain§ missing values§ noise (erroneous values)§ exceptional values or value combinations

§ ID3 does not address issues arising in real world classificationtasks.§ Modifications to the original algorithm are needed.

39

39

C4.5

§ Descendant of ID3 algorithm (Quinlan 93)

§ Upgrades:§ Gain ratio attribute selection criterion

§ Tests for value groups and quantitative attributes

§ No requirement of fully adequate attributes

§ Probabilistic approach for handling of missing values

§ Pruning§ Prepruning and postpruning

§ Converting trees to rules

40

40

C4.5 –Attribute selection criterion

§ The information gain criterion has a tendency to favour attributeswith many outcomes.

§ However, this kind of attributes may be less relevant in predictionthan attributes having a smaller number of outcomes.§ An extreme example is an attribute that is used as an identifier.

Identifiers have unique values resulting in pure nodes but they don’t havepredictive power.

§ To overcome this problem, a gain ratio criterion has beendeveloped.

41

41

§ A gain ratio is calculated as

,)(AH

I(C|A)

§ where I(C |A) is the information gain got from testing the attribute Aand

§ H (A) is the expected information needed to sort out the value ofthe attribute A i.e. the uncertainty relating to the value of theattribute A

C4.5 –Gain ratio selection criterion

∑−==

v

jjj ApApAH

12 )(log)()(

§ where p (Aj) is the probability of the value Aj (the relativefrequency of the value Aj)

42

42


§ The gain ratio criterion selects the attribute

having the highest gain ratio

among of those attributes whose information gain is at least theaverage information gain over all the attributes examined.

§ The information gain of the attribute has to be large.

43

43


§ Let’s calculate the gain ratio for the Outlook attribute of theTennis example. The information gain I(C |A) for the attributeOutlook is 0.246. Calculate the expected information for theOutlook attribute:

156.0577.1/246.0)(/)|( === AHI(C|A)ACGR

outlook freksunny 5overcast 4rain 5

§ The gain ratio for the attribute Outlook is

577.1)14/5(log)14/5()14/4(log)14/4()14/5(log)14/5()(

2

2

2

=×−×−×−=AH

44

44

C4.5 –Test types

OutlookSunny

Overcast

Rain

Outlook{Sunny,

Overcast} {Rain}

§ One branch for each possibleattribute value

§ Value groups

§ Thresholds for quantitativeattributes

Humidity75 > 75

45

45

C4.5 –Value groups

§ Tests based on qualitative attributes can take the form of

outlook in sunny, overcastoutlook = rain

§ Why value groups?

§ To avoid too small subsets of cases§ Useful patterns may become undetectable because of the scarcity of data

§ To assess equitably qualitative attributes that vary in their numbers ofpossible values§ Gain ratio criterion is biased to prefer attributes having a small number of

possible values

46

46


§ Appropriate value groups can be determined on the basis ofdomain knowledge.

§ For each appropriate grouping, an additional attribute is formed in thepreprocessing phase.

§ This approach is economical from a computational viewpoint

§ Problem: Appropriateness of a grouping may depend on the context(the part of the tree). A “constant” grouping may be too crude.

47

47


§ In C4.5, values are merged to groups in an iterative manner.

§ A greedy method

§ At first, each value forms its own group.§ Then, all possible pairs of groups are formed.

§ A grouping yielding the highest gain ratio is chosen.

§ Process continues until just two value groups remain, or until no suchmerger would result in a better division of the training data.

§ Aims to find a grouping which results in the highest gain ratio.

§ Example on the next slide:§ Michalski’s Soybean data§ 35 attributes, 19 classes, 683 training cases§ Attribute stem canker with four values: none, below soil, above soil, above 2nd

node

48

48


1) Partition intofour onevaluegroups

2) Two onevalue groupsare merged

Based on the results of the section 2, “above soil” and “above 2nd node” aremerged

No merger of the section 3 improves the situation –the process stops.

Final groups: {none}, {below soil}, {above soil, above 2nd node}

3)

49

49

§ From the overall viewpoint, the aim is to get simpler and moreaccurate trees.

§ Advantageous of value groupings depends on the applicationdomain.

§ Search for value groups can require a substantial increase incomputation.


50

50

C4.5 –Quantitative attributes

§ Tests based on quantitative attributes employ thresholds.§ The value of the attribute A is compared to some threshold Z.§ A ≤ Z, A > Z

§ The threshold is defined dynamically.

§ Cases are first sorted on the values of the attribute A beingconsidered.§ A1, A2,… ,Aw

§ The midpoint of adjacent values Ak and Ak+1

21++ kk AA

is a possible threshold Z that divides the cases of the trainingset S into two subsets.

51

51

§ There are w1 candidate thresholds.

§ The best threshold is the one that results in the largest gainratio.

§ The largest value of the attribute A in the training set that doesnot exceed the best midpoint is chosen as the threshold.§ All the threshold values appearing in the tree actually occur in the

training data.

§ After finding the threshold, the quantitative attribute can becompared to qualitative and to other quantitative attributes in theusual way.


52

52


§ Finding the threshold value Z dynamically during the treeconstruction:

Z A≤ Z A>ZP N P N

39 1 0 2 149 1 1 2 055 2 1 1 0

A Class32 P46 N52 P58 P

§ Cases are first sorted on the values of theattribute A.

§ Midpoints of successive values arepossible thresholds.

§ The gain ratio is calculated for eachcandidate threshold.

§ The best candidate is the one resultingin the highest gain ratio.

§ Choose as the threshold the biggestvalue of A in the training set that doesnot exceed the best candidate(midpoint).

§ The candidate threshold 49 yields thehighest gain ratio, and, thus, 46 ischosen as the threshold.

§ A ≤ 46, A > 46

53

53

Outlook Temperature Humidity Windy Classsunny hot 85 false Nsunny hot 90 true Novercast hot 78 false Prain mild 96 false Prain cool 80 false Prain cool 70 true Novercast cool 65 true Psunny mild 95 false Nsunny cool 70 false Prain mild 80 false Psunny mild 70 true Povercast mild 90 true Povercast hot 75 false Prain mild 96 true N

Tennis playing(Quinlan 86)

123456789

1011121314


Humidityhas beenmeasuredusing aquantitativescale

54

54


outlook = overcast: Poutlook = sunny::...humidity = high: N: humidity = normal: Poutlook = rain::...windy = true: N

windy = false: P

§ An example of a decision tree built from the Tennis data in whichthe attribute humidity has been measured using a quantitativescale.

outlook = overcast: Poutlook = sunny::...humidity <= 75: P: humidity > 75: Noutlook = rain::...windy = true: N

windy = false: P

55

55

C4.5 –ordinal attributes

§ Ordinal attributes can be handled either in the same way thannominal attributes or in the same way than quantitative attributes.

§ Processing of quantitative attributes is based on ordering ofvalues. Values of ordinal attributes have a natural order, and,thus, the approach employed for quantitative attributes can beutilised with ordinal attributes, too.

56

56

C4.5 stopping criterion

§ Stopping criteria

§ All the cases in a node belong to the same class

§ No cases in a node

§ None of the attributes improves the situation in a node

§ The number of cases in a node is too small for continuing thesplitting process:

§ Every test must have at least two outcomes having the minimum numberof cases.

– The default value for the number of cases is 2.

57

57

C4.5 Leaves

§ A leaf can contain

§ cases all belonging to a single class Cj:§ The class Cj is associated with the leaf

§ no cases:§ The most frequent class (the majority class) at the parent of the leaf is

associated with the leaf.

§ cases belonging to a mixture of classes:§ The most frequent class (the majority class) at the leaf is associated with

the leaf.

58

58

C4.5 Missing values

§ Real world data often have missing attribute values.

§ Missing values may be e.g. filled in (imputed) with§ mode, median or mean of the complete cases of a class§ estimates given by some “more intelligent” method

before running the decision tree program. However, imputation is notunproblematic.

§ Algorithms can be amended to cope with missing values§ in the tree construction

§ selecting tests§ sending cases to subtrees

§ when the tree is used in prediction§ submitting cases to subtrees

59

59

C4.5 Missing values

§ Missing values are taken into account when calculating the informationgain

))A|C(H)C(H()A(p0)A(p))A|C(H)C(H()A(p)A|C(I

known

unknownknown

−×=

×+−×=

§ where p (Aknown) is the probability that the value of the attribute A is known(i.e. the relative frequency of those cases for which the value of theattribute A is known)

§ and calculating the expected information H (A) needed to test the valueof the attribute A§ Let an attribute A have the values A1, A2, … , Av .§ Missing values are now treated as an own value, the value v+1.

∑−=+

=

1

12 )(log)()(

v

jjj ApApAH

60

60

Outlook Temperature Humidity Windy Classsunny hot 85 false Nsunny hot 90 true Novercast hot 78 false Prain mild 96 false Prain cool 80 false Prain cool 70 true Novercast cool 65 true Psunny mild 95 false Nsunny cool 70 false Prain mild 80 false Psunny mild 70 true P

mild 90 true Povercast hot 75 false Prain mild 96 true N

Tennis playing(Quinlan 86)

123456789

1011121314

C4.5 Missing values

Missing value

§ Let us assume that the Tennis example has one missing value…

61

61

C4.5 Missing values

§ The information gain for the Outlook attribute is calculated on thebasis of the 13 cases having known value.

outlook P N H(C|A )sunny 2 3 0.971overcast 3 0 0rain 3 2 0.971

961.0)13/5(log)13/5()13/8(log)13/8()( 22 =×−×−=CH

747.0971.0)13/5(

0)13/3(971.0)13/5()|(

=×+×+

×=ACH

199.0)747.0961.0()14/13()|( =−×=ACI

62

62

C4.5 Missing values

§ The expected information needed to test the value of the Outlookattribute is calculated:

809.1)14/1(log)14/1()14/5(log)14/5()14/3(log)14/3()14/5(log)14/5()(

2

2

2

2

=×−×−×−×−=AH

§ The gain ratio for the Outlook attribute is

110.0809.1/199.0)(/)|( === AHI(C|A)ACGR

sunny

overcastrain

? unknown

63

63

C4.5 –Missing values

§ When cases are sent to subtrees, a weight is given for each case.

§ If the tested attribute value is known, the case is sent to the branchcorresponding the outcome Oi with the weight w = 1.

§ Otherwise, a fraction of the case is sent to each branch Oi with theweight w = p(Oi ).§ p(Oi ) is the probability (the relative frequency) of the outcome Oi in the

current node.§ The case is divided between the possible outcomes {O1, O2, … , Ov} of the

test.

outlook

overcastsunny rain

Case 12:w = 5/13

Case 12:w = 3/13

Case 12:w = 5/13

The 13 cases with a known valuefor the Outlook attribute are sentto the corresponding sunny,overcast or rain branches with theweight w = 1.

Case 12 is divided between thesunny, overcast and rainbranches.

64

64

C4.5 –Missing valuesoutlook

overcastsunny rain§ Cases in the sunny branch:

Case no Outlook Temperature Humidity Windy Class Weight1 sunny hot 85 FALSE N 12 sunny hot 90 TRUE N 18 sunny mild 95 FALSE N 19 sunny cool 70 FALSE P 111 sunny mild 70 TRUE P 112 ? mild 90 TRUE P 5/13=0.4

§ The number of cases in a node is now interpreted as the sum of weights of(fractional) cases in the node.§ There may be whole cases and fractional cases in a node.

§ A case came to a node with the weight w. It is sent to the node(s) of the nextlevel with the weight

§ w’ = w × 1 (the value of the attribute of the current node is known)

§ w’ = w × p(Oi ) (the value of the attribute of the current node is unknown)

65

65


outlook

overcast

humidity

<=75 >75

sunny rain

P N

§ Cases in the sunny branch:

Case no Outlook Temperature Humidity Windy Class Weight1 sunny hot 85 FALSE N 12 sunny hot 90 TRUE N 18 sunny mild 95 FALSE N 19 sunny cool 70 FALSE P 111 sunny mild 70 TRUE P 112 ? mild 90 TRUE P 5/13=0.4

§ Let us assume that this subset is partitioned further by the test on humidity.§ The branch “humidity <= 75” has cases from the single class P.§ The branch “humidity > 75” has cases from both classes (class P

0.4/3.4 and class N 3/3.4)§ Since no test improves the situation further, a leaf is made (the most

frequent class in the node gives the class label).

66

66


outlook = overcast: P (3.2)outlook = sunny::...humidity <= 75: P (2): humidity > 75: N (3.4/0.4)outlook = rain::...windy = true: N (2.4/0.4)

windy = false: P (3)

§ A decision tree constructed from the data having a missing value:

§ The tree is alike the tree constructed from the original data, but nowsome leaves have a marking (N/E)§ N is the sum of fractional cases belonging to the leaf§ E is the sum of those cases misclassified by the leaf (i.e. the sum of

fractional cases belonging to classes other than suggested by the leaf)

§ The majority class gives the class label of the node.§ The majority class = the biggest class in the node

67

67


§ Classification of a new case

§ If the new case has a missing value for the attribute tested inthe current node, the case is divided between the outcomes ofthe test.

§ Now the case has multiple classification paths from the root toleaves, and, therefore, a “classification” is a class distribution.

§ The majority class is the predicted class.

68

68

C4.5 –Missing values§ A case having a missing value is classified

§ Outlook: sunny, temperature: mild, humidity: ?, windy: false

outlook = overcast: P (3.2)outlook = sunny::...humidity <= 75: P (2): humidity > 75: N (3.4/0.4)outlook = rain::...windy = true: N (2.4/0.4)

windy = false: P (3)

§ If the humidity were less than or equal to 75,the class for the case would be P

§ If the humidity were greater than 75, the classfor the case would be N with the probability of3/3.4 (88%) and P with the probability of0.4/3.4 (12%).

§ Results from normal and high branches are summed for the final class distribution§ class P: (2.0/5.4) × 100% + (3.4/5.4 ) × 12% = 44%

§ 2 cases of 5.4 training cases belonged to the “humidity <= 75” branch and in this branch theprobability of the class P is 100%

§ 3.4 cases of 5.4 training cases belonged to the “humidity > 75” branch and in this branch theprobability of the class P is 12%

§ class N: 3.4/5.4 ×88% = 56%

69

69

Underfitting and overfitting

Underfitting: when model is too simple, both training and test errors are large

Overfitting:

Test error ratestarts to increase

Training error ratecontinues todecrease

70

70

Overfitting

§ The built decision tree may overfit the training data.§ The tree is complex. Its lowest branches reflect noise and outliers

occurring in the training data.§ Lower classification accuracy on unseen cases

§ Reasons for overfitting§ Noise and outliers§ Inadequate attributes§ Too small training data§ A local maximum in the greedy search

71

71

Pruning

§ Overfitting can be overcome by pruning.

§ Pruning generally results in§ a faster classification§ a better classification accuracy on unseen cases

§ Pruning decreases the accuracy on the training data.

§ Prepruning§ Stop the tree construction early.

§ Postpruning§ Let the tree grow “full” and remove branches from the “fully grown”

tree.

§ In a combined approach both pre and postpruning are used.

72

72

Pruning

* * *

(a) The branch markedwith a star may bepartly based onerroneous orexceptional cases.

(b) The tree growthhas been stopped.(prepruning)

(c) The tree hasgrown “full” (the tree“a”) after which it hasbeen pruned.(postpruning)

73

73

Pruning

§ The tree growth can be limited in many ways.

§ Define a minimum for the number of cases in a node.§ If the number of cases in a node is below the minimum, the recursive

division of the example set is stopped and a leaf is formed.§ The leaf is labeled with the majority class or the class distribution.

§ Define a threshold for the attribute selection criterion.§ The problem: the definition of a suitable threshold

§ too high a threshold: oversimplification: useful attributes are discarded§ too low a threshold: no simplification at all (or little simplification)

74

74

Postpruning

§ Usually it is more profitable to let the tree grow complete andprune it afterwards than halt the tree growth.§ If the tree growth is halted, all the branches growing from a node are

lost.§ Postpruning allows saving some of the branches.

§ Postpruning requires more calculation than prepruning but itusually results in more reliable trees than prepruning.

§ In postpruning, parts of the tree, whose removal does notdecrease the classification accuracy on unseen cases, arediscarded.

75

75

Postpruning

§ Postpruning is based on classification errors made by the tree.

§ an error rate of a node is E /N§ N is the number of training cases belonging to the leaf

§ E is the number of cases that do not belong to the class suggested bythe leaf

§ the error rate of the whole tree: E and N are summed over all the leaves

§ a predicted error rate: the error rate on new cases

76

76

Postpruning

§ The basic idea of postpruning:

§ Start from the bottom of the tree and examine each subtree that isnot a leaf.

§ If replacement of the subtree with a leaf (or with its most frequentlyused branch) would reduce the predicted error rate, then prune thetree accordingly.§ When the error rate of any of the subtrees reduces, also the error rate of

the whole tree reduces.

§ There can be cases from several classes in a leaf, and, thus, the leafis labeled with the majority class.

§ The error rate can be predicted by using the training set or a new setof cases.§ Not a topic of this course

77

77

C4.5 Pruning

§ Prepruning

§ Every test must have at least two outcomes having the minimumnumber of cases.§ Because of the missing values, the minimum number of cases is actually

the minimum for the summed weights of the cases.§ The default value for the number of cases is 2.

§ Postpruning§ A “very” pessimistic method based on estimated error rates

§ How to calculate the very pessimistic estimates is not a topic of thiscourse. However, the idea of the pruning is presented on the nextslides.

78

78

C4.5 postpruning

§ Example:original,complete tree(Quinlan 93)§ Congressional

voting data, UCIMachineLearningRepository

physician fee freeze = n::...adoption of the budget resolution = y: democrat (151.0): adoption of the budget resolution = u: democrat (1.0): adoption of the budget resolution = n:: :...education spending = n: democrat (6.0): education spending = y: democrat (9.0): education spending = u: republican (1.0)physician fee freeze = y::...synfuels corporation cutback = n: republican (97.0/3.0)

synfuels corporation cutback = u: republican (4.0)synfuels corporation cutback = y::...duty free exports = y: democrat (2.0)

duty free exports = u: republican (1.0)duty free exports = n::...education spending = n: democrat (5.0/2.0)

education spending = y: republican (13.0/2.0)education spending = u: democrat (1.0)

physician fee freeze = u::...water project cost sharing = n: democrat (0.0)

water project cost sharing = y: democrat (4.0)water project cost sharing = u::...mx missile = n: republican (0.0)

mx missile = y: democrat (3.0/1.0)mx missile = u: republican (2.0)

79

79

C4.5 –postpruning

§ Pruned tree§ The original tree had 17 leaves, the pruned one has 5 leaves.

physician fee freeze = n: democrat (168.0/2.6)physician fee freeze = y: republican (123.0/13.9)physician fee freeze = u::...mx missile = n: democrat (3.0/1.1)

mx missile = y: democrat (4.0/2.2)mx missile = u: republican (2.0/1.0)

Subtrees have been replacedwith leaves

Subtree has beenreplaced with the mostfrequently used subtree

123 training cases in the leaf

If 123 new cases were classified, 13.9cases would be misclassified (a verypessimistic estimate)

80

80

C4.5 –postpruning

physician fee freeze = n: democrat (168.0/2.6)

physician fee freeze = n::...adoption of the budget resolution = y: democrat (151.0): adoption of the budget resolution = u: democrat (1.0): adoption of the budget resolution = n:: :...education spending = n: democrat (6.0): education spending = y: democrat (9.0): education spending = u: republican (1.0)

The subtree has been replaced with a leaf

168 training cases in the leaf. One ofthem is missclassified by the leaf.

If 168 new cases were classified, 2.6cases would be misclassified (a verypessimistic estimate)

81

81

C4.5 –postpruning

: adoption of the budget resolution = n: democrat (16.0/2.512)

physician fee freeze = n::...adoption of the budget resolution = y: democrat (151.0): adoption of the budget resolution = u: democrat (1.0): adoption of the budget resolution = n:: :...education spending = n: democrat (6.0): education spending = y: democrat (9.0): education spending = u: republican (1.0)

First, the subtree has been replaced with a leaf

physician fee freeze = n::...adoption of the budget resolution = y: democrat (151.0): adoption of the budget resolution = u: democrat (1.0): adoption of the budget resolution = n: democrat (16.0/2.512)

One training case is misclassified.The very pessimistic estimate: thenumber of predicted errors is 2.512

The very pessimisticestimate: the sum ofpredicted errors is3.273

82

82

C4.5 –postpruning

physician fee freeze = n::...adoption of the budget resolution = y: democrat (151.0): adoption of the budget resolution = u: democrat (1.0): adoption of the budget resolution = n: democrat (16/2.512)

Then, the subtree has been replaced with a leaf

physician fee freeze = n: democrat (168.0/2.6)

The very pessimisticestimate: the sum ofpredicted errors is 4.642

83

83

C4.5 –postpruning

§ Interpretation of the numbers (N /E ) in a pruned tree§ N is the number of training cases in the leaf

§ E is the number of predicted errors if a set of N unseen cases wereclassified by the tree.

§ The sum of the predicted errors over the leaves, divided by the sizeof the training set (the number of the training cases) provides animmediate estimate of the error rate of the pruned tree on newcases.§ 20.8/300=0.069 (6.9%) (The pruned tree will misclassify 6.9% of new

cases.)

84

84

C4.5 postpruning

§ Results for the Congressional voting data

Training set (300 cases)

Complete tree Pruned tree

Nodes Errors Nodes Errors

25 8 (2.7%) 7 13 (4.3%)

Test set (135 cases)

Complete tree Pruned tree

Nodes Errors Nodes Errors

25 7 (5.2%) 7 4 (3.0%)

§ 10fold crossvalidation gives the error rate of 5.3% on new cases (the averagepredicted, very pessimistic error rate on new cases is 5.6%)

85

85

DTI pros

§ Construction of a tree does not (necessarily) require anyparameter setting

§ Can handle high dimensional data

§ Can handle heterogeneous data

§ Nonparametric approach

§ Representation form is intuitive, relatively easy to interpret

86

86

DTI pros

§ Learning and classification steps are simple and fast

§ Learning: the complexity depends on the number of nodes, casesand attributes§ In each node: O(n’ p), quantitative attributes O(p n’log n’)

– n’ = number of cases in the node– p = number of attributes

§ Classification: O(w), where w is the maximum depth of the tree

§ An ”eager” method: training is computationally more expensive thanclassification

§ Quite robust to the presence of noise

§ In general, good classification accuracy comparable with otherclassification methods

87

87

DTI other issues

§ Decision tree algorithms divide the training data into smaller andsmaller subsets in a recursive fashion.

§ Problems§ Data fragmentation§ Repetition§ Replication

§ Data fragmentation§ Number of instances at the leaf nodes can be too small to make any

statistically significant decision

88

88

DTI other issues

§ Repetition§ An attribute is repeatedly tested

along some branch of thedecision tree

§ Replication• A decision tree contains

duplicate subtrees

P

Q R

S 0 1

0 1

Q

S 0

0 1

89

89

DTI other issues decision boundary

•Border line between two neighbouring regions of different classes is knownas decision boundary

•Decision boundary is parallel to axes because test condition involves asingle attribute atatime

90

90

DTI other issues multivariate split

§ Multivariate splits based on a combination of attributes§ More expressive representation§ The use of multivariate splits can prevent problems of fragmentation, repetition

and replication.§ Finding optimal test condition is computationally expensive.

x + y < 1

Class = + Class =

91

91

DTI other issues

§ Decision tree induction is a widely studied topic different kind ofenhancements to the basic algorithm have been developed.§ challenges arising from real world data: quantitative attributes,

missing values, noise, outliers

§ multivariate decision trees

§ incremental decision tree induction§ updatable decision trees

§ scalable decision tree induction§ Scalability: Classifying data sets with millions of examples and hundreds

of attributes with reasonable speed

92

92

DTI other issues

§ C4.5 is a kind of reference algorithm used in machine learningresearch.

§ In this course we will use See5, a descendant of C4.5.§ A demonstration version of See5 is freely available from

§ http://www.rulequest.com/download.html

§ The source code of C4.5 is freely available for research and teachingfrom§ http://www.rulequest.com/Personal/c4.5r8.tar.gz§ written in C

http://www.rulequest.com/download.html

http://www.rulequest.com/Personal/c4.5r8.tar.gz

93

93

References

§ These slides are partly based on the slides of the books:Han J, Kamber M. Data Mining: Concepts and Techniques, Morgan Kaufmann,

2006§ http://wwwsal.cs.uiuc.edu/~hanj/bk2/Tan PN, Steinbach M, Kumar V. Introduction to Data Mining, AddisonWesley,

2006§ http://wwwusers.cs.umn.edu/~kumar/dmbook/

§ Hand D, Mannila H, Smyth P. Principles of Data Mining, MIT Press, 2001§ Mitchell TM. Machine learning. McGrawHill, 1997.§ Quinlan JR. Induction of decision trees. Machine Learning 1: 81106, 1986§ Quinlan JR. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.§ Quinlan JR. See5. http://www.rulequest.com

http://www-sal.cs.uiuc.edu/~hanj/bk2/

http://www-users.cs.umn.edu/~kumar/dmbook/

http://www.rulequest.com

Timu a11 Classification Decision Tree Induction

Documents

Transcript of Timu a11 Classification Decision Tree Induction