Timu a11 Classification Decision Tree Induction
Transcript of Timu a11 Classification Decision Tree Induction
1
1
Tietämyksen muodostaminenKnowledge discovery
ClassificationDecision tree induction
Kati IltanenComputer SciencesSchool of Information SciencesUniversity of Tampere
2
2
Classification
§ Aim: to predict the value of a qualitative attribute§ class labels (the values of the target attribute, class) are predicted
§ Every case belongs to one of the mutually exclusive classes. Thisclass is known.§ supervised learning
§ The classification method classifies the training data based on theattribute values and class labels.§ constructs a model
3
3
Classification
§ The model is used for classifying new data.§ The model is evaluated (e.g. accuracy and subjective estimate)§ If the model is acceptable, it is used to classify cases whose class
labels are not known.§ new data (unknown data, previously unseen data)
§ Classification methods§ Decision trees, rules, k nearest neighbour method, naïve Bayesian
classifier, neural networks, ...
§ Application examples:§ To give a diagnosis suggestion on the basis of the symptoms and
test results of a patient§ To predict the paying capacity of a loan applicant
4
4
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
Learningalgorithm
IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’ELSE tenured = ‘no’
Training data
Classifier (Model)
Constructing and testing a classifier
Knownclasslabels
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Test data
The model misclassifiesthe second test case(gives ’yes’, the knownclass is ’no’)
noyesyesyes
5
5
(Jeff, Professor, 4) Tenured? Yes
New data:
Using the classifier
IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’ELSE tenured = ‘no’
Classifier (Model)
6
6
Decision tree induction
§ TDIDT (Top Down Induction of Decision Trees)
§ Inductive learning: general knowledge from separate cases§ Cases are described using fixedlength attribute vectors.§ Each case belongs to one class.§ Classes are mutually exclusive.§ The class of a case is known: supervised learning
§ Knowledge is represented in the form a decision tree.§ A decision tree is a classification model.
§ The tree is constructed in a topdown manner (from the root tothe leaves)
7
7
outlook
overcast
humidity windy
high normal falsetrue
sunny rain
N P
P
PN
Classes: P = play tennis, N = don’t play tennis (Quinlan 86)
Decision tree induction
Outlook Temperature Humidity Windy Classsunny hot high false Nsunny hot high true Novercast hot high false Prain mild high false Prain cool normal false Prain cool normal true Novercast cool normal true Psunny mild high false Nsunny cool normal false Prain mild normal false Psunny mild normal true Povercast mild high true Povercast hot normal false Prain mild high true N Decision tree
Training data: Saturday mornings
8
8
§ Decision tree
§ Inner nodes contain tests basedon attributes (test nodes)
§ Branches correspond to theoutcomes of the tests (attributevalues)
§ Leaf nodes (leaves) contain theclass information (one class orclass distribution)
Decision tree
outlook
overcast
humidity windy
high normal falsetrue
sunny rain
N P
P
PN
9
9
§ Classification of a new case startsfrom the root of the tree.
§ The attribute assigned to the rootnode is examined and a branchcorresponding to the attributevalue is followed.
§ This process continues until a leafnode is encountered.§ The leaf predicts the class of the
new case.
Decision tree
outlook
overcast
humidity windy
high normal falsetrue
sunny rain
N P
P
PN
10
10
Decision tree
§ The classification path from the root to a leaf gives an explanationfor the decision.
§ The number of tested attributes depends on the classificationpath.§ It is not necessary to test all the attributes in all the paths.
§ A classification path: a conjunction of constraints set on attributes
§ A decision tree: a disjunction of the classification paths
11
11
Building a decision tree
§ Building a decision tree is a two step process
§ Tree construction§ A complete (fullygrown) tree is built based on the training data.§ (prepruning: the growth of tree is restricted)
§ Tree pruning§ postpruning: branches are pruned from a complete tree (or from
a prepruned tree)
12
12
TDIDT: Basic algorithm
§ A decision tree is constructed in a topdown recursive divideandconquer manner.
§ In the beginning, all the training examples are at the root.
§ If the stopping criterion is fulfilled, a leaf node is formed.
§ If the stopping criterion is not fulfilled, the best attribute isselected according to some criterion (a greedy algorithm) and§ a test node is formed§ cases are divided into subsets according to values of the chosen
attribute§ a decision tree is formed recursively for each subset
13
13
DTIDT: Basic algorithm
Generate a decision tree(1) Create a node N
(2) if (stopping criterion is fulfilled)(3) Make a leaf node (node N)(4) else(5) Choose the best attribute and make a(6) test node (node N) that tests the chosen attribute
(7) Divide cases into subsets according to the(8) values of the chosen attribute
(9) Generate a decision tree for each subset
14
14
TDIDT: Key questions
§ How to select the best attribute?
§ How to specify the attribute test condition?§ How to form inner nodes and branches?
§ When to stop the recursive splitting?
§ How to form decision nodes (leaves)?
§ How to prune a tree?
15
15
Attribute selection criterion
§ How to select the best attribute?
§ Adequacy of attributes§ Attributes are adequate for the classification task, if all the cases
having the same attribute values belong to the same class.
§ If the attributes are adequate it is always possible to construct adecision tree which correctly classifies all the training data.§ Usually there exist several correctly classifying decision trees.
§ In the worst case, there is a leaf in the tree for each of the trainingcases.
16
16
Simple decision tree
A simple decision tree for the“Tennis playing” classificationtask
17
17
Complex decision tree
A complex decisiontree for the sameclassification task
18
18
Attribute selection criterion
§ The aim is to generate simple (small) decision trees.§ Derives from the principle called Occam’s razor:
§ If there are two models having the same accuracy on the training data, thesmaller one (simpler one) can be seen more general and thus better§ Smaller trees: more general, easier to understand and possibly more
accurate in classifying unseen cases
§ Try to generate simple trees by generating simple nodes.
§ The complexity of a node is§ in its largest when the node has an equal number of cases from every class
of the node
§ in its smallest when the node has cases from one class only
§ Heuristic attribute selection measures (measures of goodness of split)are used. These aim to generate homogeneous (pure) child nodes(subsets).
19
19
TDIDT algorithm family
§ CLS (Concept Learning System)§ E.B. Hunt (50’s and 60’s)§ To simulate human problem solving methods§ Analysing the content of English texts, medical diagnostics
§ ID3 (Iterative Dichotomizer 3)§ J.R. Quinlan (end of 70’s)§ Chess endgames§ Applications from medical diagnostics to scouting
§ Other early decision tree algorithms§ CART (Classification and Regression Trees) (84)§ Assistant (84)
§ C4.5, C5, See5§ descendants of ID3§ Addresses issues arising in real world classification tasks§ C4.5 is one of the most widely used machine learning algorithms, frequently
used as a reference algorithm in machine learning research
20
20
ID3
§ Assumes that§ attributes are categorical and have a small number of possible
values§ the class (the target attribute) has two possible values
§ applicable to classification tasks with two classes§ attributes are adequate§ data contain no missing values
§ ID3 selects the best attribute according to a criterion calledinformation gain§ Criterion selects an attribute that maximises information gain (or
minimises entropy)
21
21
§ Let§ S be a training set that contains s cases (s is the number of cases)§ the class attribute C have values C1 , … , Cm (m is the number of
classes)§ In ID3 m = 2
§ si be the number of cases belonging to the class Ci in the training setS and p(Ci) = si /s the relative frequency of the class Ci in S
ID3: Attribute selection criterion
22
22
§ The expected information needed to classify an arbitrary case inS (or entropy of C in S) is
∑−==
m
iii CpCpCH
12 )(log)()(
§ 2based logarithm, because the information is coded in bits
§ We define in this context that if p(Ci) = 0 then p(Ci) log2 p(Ci) returns 0 (zero)
++ ∈∈=⇔= RaRlkakxa xk },1{\,,log
laa kkl log:loglog =
ID3: Attribute selection criterion
23
23
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
p(C1) = 0/6 = 0 p(C2) = 6/6 = 1H(C) = –0 log2 0 –1 log2 1 = –0 –0 = 0
p(C1) = 1/6 p(C2) = 5/6H(C) = –(1/6) log2 (1/6) –(5/6) log2 (5/6) = 0.65
p(C1) = 2/6 p(C2) = 4/6H(C) = –(2/6) log2 (2/6) –(4/6) log2 (4/6) = 0.92
ID3: Attribute selection criterion
C1 3
C2 3p(C1) = 3/6 p(C2) = 3/6H(C) = –(3/6) log2 (3/6) –(3/6) log2 (3/6) = 1
§ Maximum (= log2 m) when cases are equally distributed among theclasses§ m = number of classes
§ Minimum (= 0) when all cases belong to the same class
24
24
ID3: Attribute selection criterion
§ Let an attribute A have the values Aj , j = 1,… ,v
§ Let the set S be divided into subsets {S1, S2 , … , Sv} according tothe values of the attribute A
§ The expected information needed to classify an arbitrary case inthe branch corresponding the value Aj is
§ Consider only those cases having the value Aj for the attribute A and calculate p(Ci) in the setof these cases
∑−==
m
ijijij ACpACpACH
12 )|(log)|()|(
25
25
ID3: Attribute selection criterion
§ The expected information needed to classify an arbitrary casewhen using the attribute A as root is
))H(C|Ap(AH(C|A)v
jjj∑=
=1
)|()( ACHCHI(C|A) −=
§ Information gained by branching on the attribute A is
§ p(Ai) is the relative frequency of the cases having value Aj for the attribute A in the set S
§ ID3 chooses the attribute resulting in the greatest information gain as theattribute for the root of the decision tree.
26
26
ID3: Tests
§ Tests in the inner nodes take the form of§ A = Aj
§ An attribute A has the value Aj
§ Outcomes of a test are mutually exclusive.
§ There is an own branch in the tree for each possible outcome .
27
27
ID3: Stopping criterion
§ ID3 assumes that attributes are adequate.
§ It splits the data in recursive fashion, until all the cases of a nodebelong to the same class.
§ The class of a leaf node is defined on the basis of the class of thecases in the node.§ If the leaf is empty (there are no cases with some particular value of
an attribute), the class is unknown (the leaf is labelled as ‘null’)
28
28
Outlook Temperature Humidity Windy Classsunny hot high false Nsunny hot high true Novercast hot high false Prain mild high false Prain cool normal false Prain cool normal true Novercast cool normal true Psunny mild high false Nsunny cool normal false Prain mild normal false Psunny mild normal true Povercast mild high true Povercast hot normal false Prain mild high true N
Playing tennis(Quinlan 86)
Classes:P = positive(play)N = negative(don’t play)
123456789
1011121314
Example: ID3 (1)
Cases:Saturdaymornings
29
29
Example: ID3 (2)
940.0145log
145
149log
149)( 22 =−−=CH
§ Class P (positive class): play tennis§ 9 cases
§ Class N (negative class): don’t play tennis§ 5 cases
§ The expected information needed to classify an arbitrary case inS is
30
30
outlook P N H(C|A )sunny 2 3 0.971overcast 4 0 0rain 3 2 0.971
971.053log
53
52log
52)|( 221 =−−=ACH
040log
40
44log
44)|( 222 =−−=ACH
971.052log
52
53log
53)|( 223 =−−=ACH
sunny:
overcast:
rain:
§ The expected informationrequired for each of the subtreesafter using the attribute Outlookto split the set S into 3 subsets
j
Example: ID3 (3)
31
31
Example: ID3 (4)
694.0971.0145
0144
971.0145
)|( =×+×+×=ACH
246.0694.0940.0)|()()|( =−=−= ACHCHACI
§ The expected information needed to classify an arbitrary casefor the tree with the attribute Outlook as root is
§ The information gained by branching on the attribute Outlook (A)is
32
32
Example: ID3 (5)
§ The information gain for other candidate attributes is calculatedsimilarly§ I(C|temperature) = 0.029§ I (C|humidity) = 0.151§ I (C|windy) = 0.048
§ The attribute resulting in the greatest information gain is chosenas the attribute for the root of the decision tree.§ I(C|outlook) = 0.246
33
33
outlook
overcastsunny rain
Cases(1, sunny, hot,… , N)(2, sunny, hot,… , N)(8, sunny, mild,… , N)(9, sunny, cool,… , P)(11, sunny, mild,… , P)
Cases(3, overcast, hot,… , P)(7, overcast, cool,… , P)(12, overcast, mild,… , P)(13, overcast, hot,… , P)
Cases(4, rain, mild,… , P)(5, rain, cool,… , P)(6, rain, cool,… , N)(10, rain, mild,… , P)(14, rain, mild,… , N)
§ The attribute Outlook has been chosen and the cases havebeen divided into subsets according to their values of theOutlook attribute.
Example: ID3 (6)
34
34
Example: ID3 (7)
§ The branch corresponding the outcome sunny is built next.
Cases:(1, sunny, hot, high, false, N)(2, sunny, hot, high, true, N)(8, sunny, mild, high, false, N)(9, sunny, cool, normal, false, P)(11, sunny, mild, normal, true, P)
971.053log
53
52log
52)( 22 =−−=CH
§ Calculate the expected information…
§ and the information gain for all candidate attributes...
35
35
Example: ID3 (8)
Tempera P N H(C|A )hot 0 2 0mild 1 1 1cool 1 0 0
Humidity P N H(C|A )high 0 3 0normal 2 0 0
W indy P N H(C|A )FALSE 1 2 0.918TRUE 1 1 1
400.0
0511
520
52)|(
=
×+×+×=eTemperaturCH
00520
53)|( =×+×=HumidityCH
951.0152918.0
53)|( =×+×=WindyCH
971.00971.0)|( =−=HumidityCI
571.0400.0971.0)|( =−=eTemperaturCI
020.0951.0971.0)|( =−=WindyCI
j
j
j
36
36
Example: ID3 (9)
outlook
overcast
humidity
high normal
sunny rain
PN
§ Humidity is chosen
§ Cases are sent downto high and normalbranches
§ The cases in highbranch are all of thesame class: a leafnode is formed
§ The same situation inthe normal branch
§ Branches forovercast and rainare built in thesimilar way…
37
37
Example: ID3 (10)
§ Complete decision tree and classification of a new case
(Outlook: rain,Temperature: hot,Humidity: high,Windy: true)
Play tennis?
N
outlook
overcast
humidity windy
high normal falsetrue
sunny rain
N P
P
PN
38
38
Real world classification tasks
§ Real world data can be mixed.§ Attributes may have different scales (both qualitative and
quantitative).
§ Data may contain§ missing values§ noise (erroneous values)§ exceptional values or value combinations
§ ID3 does not address issues arising in real world classificationtasks.§ Modifications to the original algorithm are needed.
39
39
C4.5
§ Descendant of ID3 algorithm (Quinlan 93)
§ Upgrades:§ Gain ratio attribute selection criterion
§ Tests for value groups and quantitative attributes
§ No requirement of fully adequate attributes
§ Probabilistic approach for handling of missing values
§ Pruning§ Prepruning and postpruning
§ Converting trees to rules
40
40
C4.5 –Attribute selection criterion
§ The information gain criterion has a tendency to favour attributeswith many outcomes.
§ However, this kind of attributes may be less relevant in predictionthan attributes having a smaller number of outcomes.§ An extreme example is an attribute that is used as an identifier.
Identifiers have unique values resulting in pure nodes but they don’t havepredictive power.
§ To overcome this problem, a gain ratio criterion has beendeveloped.
41
41
§ A gain ratio is calculated as
,)(AH
I(C|A)
§ where I(C |A) is the information gain got from testing the attribute Aand
§ H (A) is the expected information needed to sort out the value ofthe attribute A i.e. the uncertainty relating to the value of theattribute A
C4.5 –Gain ratio selection criterion
∑−==
v
jjj ApApAH
12 )(log)()(
§ where p (Aj) is the probability of the value Aj (the relativefrequency of the value Aj)
42
42
C4.5 –Gain ratio selection criterion
§ The gain ratio criterion selects the attribute
having the highest gain ratio
among of those attributes whose information gain is at least theaverage information gain over all the attributes examined.
§ The information gain of the attribute has to be large.
43
43
C4.5 –Gain ratio selection criterion
§ Let’s calculate the gain ratio for the Outlook attribute of theTennis example. The information gain I(C |A) for the attributeOutlook is 0.246. Calculate the expected information for theOutlook attribute:
156.0577.1/246.0)(/)|( === AHI(C|A)ACGR
outlook freksunny 5overcast 4rain 5
§ The gain ratio for the attribute Outlook is
577.1)14/5(log)14/5()14/4(log)14/4()14/5(log)14/5()(
2
2
2
=×−×−×−=AH
44
44
C4.5 –Test types
OutlookSunny
Overcast
Rain
Outlook{Sunny,
Overcast} {Rain}
§ One branch for each possibleattribute value
§ Value groups
§ Thresholds for quantitativeattributes
Humidity75 > 75
45
45
C4.5 –Value groups
§ Tests based on qualitative attributes can take the form of
outlook in sunny, overcastoutlook = rain
§ Why value groups?
§ To avoid too small subsets of cases§ Useful patterns may become undetectable because of the scarcity of data
§ To assess equitably qualitative attributes that vary in their numbers ofpossible values§ Gain ratio criterion is biased to prefer attributes having a small number of
possible values
46
46
C4.5 –Value groups
§ Appropriate value groups can be determined on the basis ofdomain knowledge.
§ For each appropriate grouping, an additional attribute is formed in thepreprocessing phase.
§ This approach is economical from a computational viewpoint
§ Problem: Appropriateness of a grouping may depend on the context(the part of the tree). A “constant” grouping may be too crude.
47
47
C4.5 –Value groups
§ In C4.5, values are merged to groups in an iterative manner.
§ A greedy method
§ At first, each value forms its own group.§ Then, all possible pairs of groups are formed.
§ A grouping yielding the highest gain ratio is chosen.
§ Process continues until just two value groups remain, or until no suchmerger would result in a better division of the training data.
§ Aims to find a grouping which results in the highest gain ratio.
§ Example on the next slide:§ Michalski’s Soybean data§ 35 attributes, 19 classes, 683 training cases§ Attribute stem canker with four values: none, below soil, above soil, above 2nd
node
48
48
C4.5 –Value groups
1) Partition intofour onevaluegroups
2) Two onevalue groupsare merged
Based on the results of the section 2, “above soil” and “above 2nd node” aremerged
No merger of the section 3 improves the situation –the process stops.
Final groups: {none}, {below soil}, {above soil, above 2nd node}
3)
49
49
§ From the overall viewpoint, the aim is to get simpler and moreaccurate trees.
§ Advantageous of value groupings depends on the applicationdomain.
§ Search for value groups can require a substantial increase incomputation.
C4.5 –Value groups
50
50
C4.5 –Quantitative attributes
§ Tests based on quantitative attributes employ thresholds.§ The value of the attribute A is compared to some threshold Z.§ A ≤ Z, A > Z
§ The threshold is defined dynamically.
§ Cases are first sorted on the values of the attribute A beingconsidered.§ A1, A2,… ,Aw
§ The midpoint of adjacent values Ak and Ak+1
21++ kk AA
is a possible threshold Z that divides the cases of the trainingset S into two subsets.
51
51
§ There are w1 candidate thresholds.
§ The best threshold is the one that results in the largest gainratio.
§ The largest value of the attribute A in the training set that doesnot exceed the best midpoint is chosen as the threshold.§ All the threshold values appearing in the tree actually occur in the
training data.
§ After finding the threshold, the quantitative attribute can becompared to qualitative and to other quantitative attributes in theusual way.
C4.5 –Quantitative attributes
52
52
C4.5 –Quantitative attributes
§ Finding the threshold value Z dynamically during the treeconstruction:
Z A≤ Z A>ZP N P N
39 1 0 2 149 1 1 2 055 2 1 1 0
A Class32 P46 N52 P58 P
§ Cases are first sorted on the values of theattribute A.
§ Midpoints of successive values arepossible thresholds.
§ The gain ratio is calculated for eachcandidate threshold.
§ The best candidate is the one resultingin the highest gain ratio.
§ Choose as the threshold the biggestvalue of A in the training set that doesnot exceed the best candidate(midpoint).
§ The candidate threshold 49 yields thehighest gain ratio, and, thus, 46 ischosen as the threshold.
§ A ≤ 46, A > 46
53
53
Outlook Temperature Humidity Windy Classsunny hot 85 false Nsunny hot 90 true Novercast hot 78 false Prain mild 96 false Prain cool 80 false Prain cool 70 true Novercast cool 65 true Psunny mild 95 false Nsunny cool 70 false Prain mild 80 false Psunny mild 70 true Povercast mild 90 true Povercast hot 75 false Prain mild 96 true N
Tennis playing(Quinlan 86)
123456789
1011121314
C4.5 –Quantitative attributes
Humidityhas beenmeasuredusing aquantitativescale
54
54
C4.5 –Quantitative attributes
outlook = overcast: Poutlook = sunny::...humidity = high: N: humidity = normal: Poutlook = rain::...windy = true: N
windy = false: P
§ An example of a decision tree built from the Tennis data in whichthe attribute humidity has been measured using a quantitativescale.
outlook = overcast: Poutlook = sunny::...humidity <= 75: P: humidity > 75: Noutlook = rain::...windy = true: N
windy = false: P
55
55
C4.5 –ordinal attributes
§ Ordinal attributes can be handled either in the same way thannominal attributes or in the same way than quantitative attributes.
§ Processing of quantitative attributes is based on ordering ofvalues. Values of ordinal attributes have a natural order, and,thus, the approach employed for quantitative attributes can beutilised with ordinal attributes, too.
56
56
C4.5 stopping criterion
§ Stopping criteria
§ All the cases in a node belong to the same class
§ No cases in a node
§ None of the attributes improves the situation in a node
§ The number of cases in a node is too small for continuing thesplitting process:
§ Every test must have at least two outcomes having the minimum numberof cases.
– The default value for the number of cases is 2.
57
57
C4.5 Leaves
§ A leaf can contain
§ cases all belonging to a single class Cj:§ The class Cj is associated with the leaf
§ no cases:§ The most frequent class (the majority class) at the parent of the leaf is
associated with the leaf.
§ cases belonging to a mixture of classes:§ The most frequent class (the majority class) at the leaf is associated with
the leaf.
58
58
C4.5 Missing values
§ Real world data often have missing attribute values.
§ Missing values may be e.g. filled in (imputed) with§ mode, median or mean of the complete cases of a class§ estimates given by some “more intelligent” method
before running the decision tree program. However, imputation is notunproblematic.
§ Algorithms can be amended to cope with missing values§ in the tree construction
§ selecting tests§ sending cases to subtrees
§ when the tree is used in prediction§ submitting cases to subtrees
59
59
C4.5 Missing values
§ Missing values are taken into account when calculating the informationgain
))A|C(H)C(H()A(p0)A(p))A|C(H)C(H()A(p)A|C(I
known
unknownknown
−×=
×+−×=
§ where p (Aknown) is the probability that the value of the attribute A is known(i.e. the relative frequency of those cases for which the value of theattribute A is known)
§ and calculating the expected information H (A) needed to test the valueof the attribute A§ Let an attribute A have the values A1, A2, … , Av .§ Missing values are now treated as an own value, the value v+1.
∑−=+
=
1
12 )(log)()(
v
jjj ApApAH
60
60
Outlook Temperature Humidity Windy Classsunny hot 85 false Nsunny hot 90 true Novercast hot 78 false Prain mild 96 false Prain cool 80 false Prain cool 70 true Novercast cool 65 true Psunny mild 95 false Nsunny cool 70 false Prain mild 80 false Psunny mild 70 true P
mild 90 true Povercast hot 75 false Prain mild 96 true N
Tennis playing(Quinlan 86)
123456789
1011121314
C4.5 Missing values
Missing value
§ Let us assume that the Tennis example has one missing value…
61
61
C4.5 Missing values
§ The information gain for the Outlook attribute is calculated on thebasis of the 13 cases having known value.
outlook P N H(C|A )sunny 2 3 0.971overcast 3 0 0rain 3 2 0.971
961.0)13/5(log)13/5()13/8(log)13/8()( 22 =×−×−=CH
747.0971.0)13/5(
0)13/3(971.0)13/5()|(
=×+×+
×=ACH
199.0)747.0961.0()14/13()|( =−×=ACI
62
62
C4.5 Missing values
§ The expected information needed to test the value of the Outlookattribute is calculated:
809.1)14/1(log)14/1()14/5(log)14/5()14/3(log)14/3()14/5(log)14/5()(
2
2
2
2
=×−×−×−×−=AH
§ The gain ratio for the Outlook attribute is
110.0809.1/199.0)(/)|( === AHI(C|A)ACGR
sunny
overcastrain
? unknown
63
63
C4.5 –Missing values
§ When cases are sent to subtrees, a weight is given for each case.
§ If the tested attribute value is known, the case is sent to the branchcorresponding the outcome Oi with the weight w = 1.
§ Otherwise, a fraction of the case is sent to each branch Oi with theweight w = p(Oi ).§ p(Oi ) is the probability (the relative frequency) of the outcome Oi in the
current node.§ The case is divided between the possible outcomes {O1, O2, … , Ov} of the
test.
outlook
overcastsunny rain
Case 12:w = 5/13
Case 12:w = 3/13
Case 12:w = 5/13
The 13 cases with a known valuefor the Outlook attribute are sentto the corresponding sunny,overcast or rain branches with theweight w = 1.
Case 12 is divided between thesunny, overcast and rainbranches.
64
64
C4.5 –Missing valuesoutlook
overcastsunny rain§ Cases in the sunny branch:
Case no Outlook Temperature Humidity Windy Class Weight1 sunny hot 85 FALSE N 12 sunny hot 90 TRUE N 18 sunny mild 95 FALSE N 19 sunny cool 70 FALSE P 111 sunny mild 70 TRUE P 112 ? mild 90 TRUE P 5/13=0.4
§ The number of cases in a node is now interpreted as the sum of weights of(fractional) cases in the node.§ There may be whole cases and fractional cases in a node.
§ A case came to a node with the weight w. It is sent to the node(s) of the nextlevel with the weight
§ w’ = w × 1 (the value of the attribute of the current node is known)
§ w’ = w × p(Oi ) (the value of the attribute of the current node is unknown)
65
65
C4.5 –Missing values
outlook
overcast
humidity
<=75 >75
sunny rain
P N
§ Cases in the sunny branch:
Case no Outlook Temperature Humidity Windy Class Weight1 sunny hot 85 FALSE N 12 sunny hot 90 TRUE N 18 sunny mild 95 FALSE N 19 sunny cool 70 FALSE P 111 sunny mild 70 TRUE P 112 ? mild 90 TRUE P 5/13=0.4
§ Let us assume that this subset is partitioned further by the test on humidity.§ The branch “humidity <= 75” has cases from the single class P.§ The branch “humidity > 75” has cases from both classes (class P
0.4/3.4 and class N 3/3.4)§ Since no test improves the situation further, a leaf is made (the most
frequent class in the node gives the class label).
66
66
C4.5 –Missing values
outlook = overcast: P (3.2)outlook = sunny::...humidity <= 75: P (2): humidity > 75: N (3.4/0.4)outlook = rain::...windy = true: N (2.4/0.4)
windy = false: P (3)
§ A decision tree constructed from the data having a missing value:
§ The tree is alike the tree constructed from the original data, but nowsome leaves have a marking (N/E)§ N is the sum of fractional cases belonging to the leaf§ E is the sum of those cases misclassified by the leaf (i.e. the sum of
fractional cases belonging to classes other than suggested by the leaf)
§ The majority class gives the class label of the node.§ The majority class = the biggest class in the node
67
67
C4.5 –Missing values
§ Classification of a new case
§ If the new case has a missing value for the attribute tested inthe current node, the case is divided between the outcomes ofthe test.
§ Now the case has multiple classification paths from the root toleaves, and, therefore, a “classification” is a class distribution.
§ The majority class is the predicted class.
68
68
C4.5 –Missing values§ A case having a missing value is classified
§ Outlook: sunny, temperature: mild, humidity: ?, windy: false
outlook = overcast: P (3.2)outlook = sunny::...humidity <= 75: P (2): humidity > 75: N (3.4/0.4)outlook = rain::...windy = true: N (2.4/0.4)
windy = false: P (3)
§ If the humidity were less than or equal to 75,the class for the case would be P
§ If the humidity were greater than 75, the classfor the case would be N with the probability of3/3.4 (88%) and P with the probability of0.4/3.4 (12%).
§ Results from normal and high branches are summed for the final class distribution§ class P: (2.0/5.4) × 100% + (3.4/5.4 ) × 12% = 44%
§ 2 cases of 5.4 training cases belonged to the “humidity <= 75” branch and in this branch theprobability of the class P is 100%
§ 3.4 cases of 5.4 training cases belonged to the “humidity > 75” branch and in this branch theprobability of the class P is 12%
§ class N: 3.4/5.4 ×88% = 56%
69
69
Underfitting and overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting:
Test error ratestarts to increase
Training error ratecontinues todecrease
70
70
Overfitting
§ The built decision tree may overfit the training data.§ The tree is complex. Its lowest branches reflect noise and outliers
occurring in the training data.§ Lower classification accuracy on unseen cases
§ Reasons for overfitting§ Noise and outliers§ Inadequate attributes§ Too small training data§ A local maximum in the greedy search
71
71
Pruning
§ Overfitting can be overcome by pruning.
§ Pruning generally results in§ a faster classification§ a better classification accuracy on unseen cases
§ Pruning decreases the accuracy on the training data.
§ Prepruning§ Stop the tree construction early.
§ Postpruning§ Let the tree grow “full” and remove branches from the “fully grown”
tree.
§ In a combined approach both pre and postpruning are used.
72
72
Pruning
* * *
(a) The branch markedwith a star may bepartly based onerroneous orexceptional cases.
(b) The tree growthhas been stopped.(prepruning)
(c) The tree hasgrown “full” (the tree“a”) after which it hasbeen pruned.(postpruning)
73
73
Pruning
§ The tree growth can be limited in many ways.
§ Define a minimum for the number of cases in a node.§ If the number of cases in a node is below the minimum, the recursive
division of the example set is stopped and a leaf is formed.§ The leaf is labeled with the majority class or the class distribution.
§ Define a threshold for the attribute selection criterion.§ The problem: the definition of a suitable threshold
§ too high a threshold: oversimplification: useful attributes are discarded§ too low a threshold: no simplification at all (or little simplification)
74
74
Postpruning
§ Usually it is more profitable to let the tree grow complete andprune it afterwards than halt the tree growth.§ If the tree growth is halted, all the branches growing from a node are
lost.§ Postpruning allows saving some of the branches.
§ Postpruning requires more calculation than prepruning but itusually results in more reliable trees than prepruning.
§ In postpruning, parts of the tree, whose removal does notdecrease the classification accuracy on unseen cases, arediscarded.
75
75
Postpruning
§ Postpruning is based on classification errors made by the tree.
§ an error rate of a node is E /N§ N is the number of training cases belonging to the leaf
§ E is the number of cases that do not belong to the class suggested bythe leaf
§ the error rate of the whole tree: E and N are summed over all the leaves
§ a predicted error rate: the error rate on new cases
76
76
Postpruning
§ The basic idea of postpruning:
§ Start from the bottom of the tree and examine each subtree that isnot a leaf.
§ If replacement of the subtree with a leaf (or with its most frequentlyused branch) would reduce the predicted error rate, then prune thetree accordingly.§ When the error rate of any of the subtrees reduces, also the error rate of
the whole tree reduces.
§ There can be cases from several classes in a leaf, and, thus, the leafis labeled with the majority class.
§ The error rate can be predicted by using the training set or a new setof cases.§ Not a topic of this course
77
77
C4.5 Pruning
§ Prepruning
§ Every test must have at least two outcomes having the minimumnumber of cases.§ Because of the missing values, the minimum number of cases is actually
the minimum for the summed weights of the cases.§ The default value for the number of cases is 2.
§ Postpruning§ A “very” pessimistic method based on estimated error rates
§ How to calculate the very pessimistic estimates is not a topic of thiscourse. However, the idea of the pruning is presented on the nextslides.
78
78
C4.5 postpruning
§ Example:original,complete tree(Quinlan 93)§ Congressional
voting data, UCIMachineLearningRepository
physician fee freeze = n::...adoption of the budget resolution = y: democrat (151.0): adoption of the budget resolution = u: democrat (1.0): adoption of the budget resolution = n:: :...education spending = n: democrat (6.0): education spending = y: democrat (9.0): education spending = u: republican (1.0)physician fee freeze = y::...synfuels corporation cutback = n: republican (97.0/3.0)
synfuels corporation cutback = u: republican (4.0)synfuels corporation cutback = y::...duty free exports = y: democrat (2.0)
duty free exports = u: republican (1.0)duty free exports = n::...education spending = n: democrat (5.0/2.0)
education spending = y: republican (13.0/2.0)education spending = u: democrat (1.0)
physician fee freeze = u::...water project cost sharing = n: democrat (0.0)
water project cost sharing = y: democrat (4.0)water project cost sharing = u::...mx missile = n: republican (0.0)
mx missile = y: democrat (3.0/1.0)mx missile = u: republican (2.0)
79
79
C4.5 –postpruning
§ Pruned tree§ The original tree had 17 leaves, the pruned one has 5 leaves.
physician fee freeze = n: democrat (168.0/2.6)physician fee freeze = y: republican (123.0/13.9)physician fee freeze = u::...mx missile = n: democrat (3.0/1.1)
mx missile = y: democrat (4.0/2.2)mx missile = u: republican (2.0/1.0)
Subtrees have been replacedwith leaves
Subtree has beenreplaced with the mostfrequently used subtree
123 training cases in the leaf
If 123 new cases were classified, 13.9cases would be misclassified (a verypessimistic estimate)
80
80
C4.5 –postpruning
physician fee freeze = n: democrat (168.0/2.6)
physician fee freeze = n::...adoption of the budget resolution = y: democrat (151.0): adoption of the budget resolution = u: democrat (1.0): adoption of the budget resolution = n:: :...education spending = n: democrat (6.0): education spending = y: democrat (9.0): education spending = u: republican (1.0)
The subtree has been replaced with a leaf
168 training cases in the leaf. One ofthem is missclassified by the leaf.
If 168 new cases were classified, 2.6cases would be misclassified (a verypessimistic estimate)
81
81
C4.5 –postpruning
: adoption of the budget resolution = n: democrat (16.0/2.512)
physician fee freeze = n::...adoption of the budget resolution = y: democrat (151.0): adoption of the budget resolution = u: democrat (1.0): adoption of the budget resolution = n:: :...education spending = n: democrat (6.0): education spending = y: democrat (9.0): education spending = u: republican (1.0)
First, the subtree has been replaced with a leaf
physician fee freeze = n::...adoption of the budget resolution = y: democrat (151.0): adoption of the budget resolution = u: democrat (1.0): adoption of the budget resolution = n: democrat (16.0/2.512)
One training case is misclassified.The very pessimistic estimate: thenumber of predicted errors is 2.512
The very pessimisticestimate: the sum ofpredicted errors is3.273
82
82
C4.5 –postpruning
physician fee freeze = n::...adoption of the budget resolution = y: democrat (151.0): adoption of the budget resolution = u: democrat (1.0): adoption of the budget resolution = n: democrat (16/2.512)
Then, the subtree has been replaced with a leaf
physician fee freeze = n: democrat (168.0/2.6)
The very pessimisticestimate: the sum ofpredicted errors is 4.642
83
83
C4.5 –postpruning
§ Interpretation of the numbers (N /E ) in a pruned tree§ N is the number of training cases in the leaf
§ E is the number of predicted errors if a set of N unseen cases wereclassified by the tree.
§ The sum of the predicted errors over the leaves, divided by the sizeof the training set (the number of the training cases) provides animmediate estimate of the error rate of the pruned tree on newcases.§ 20.8/300=0.069 (6.9%) (The pruned tree will misclassify 6.9% of new
cases.)
84
84
C4.5 postpruning
§ Results for the Congressional voting data
Training set (300 cases)
Complete tree Pruned tree
Nodes Errors Nodes Errors
25 8 (2.7%) 7 13 (4.3%)
Test set (135 cases)
Complete tree Pruned tree
Nodes Errors Nodes Errors
25 7 (5.2%) 7 4 (3.0%)
§ 10fold crossvalidation gives the error rate of 5.3% on new cases (the averagepredicted, very pessimistic error rate on new cases is 5.6%)
85
85
DTI pros
§ Construction of a tree does not (necessarily) require anyparameter setting
§ Can handle high dimensional data
§ Can handle heterogeneous data
§ Nonparametric approach
§ Representation form is intuitive, relatively easy to interpret
86
86
DTI pros
§ Learning and classification steps are simple and fast
§ Learning: the complexity depends on the number of nodes, casesand attributes§ In each node: O(n’ p), quantitative attributes O(p n’log n’)
– n’ = number of cases in the node– p = number of attributes
§ Classification: O(w), where w is the maximum depth of the tree
§ An ”eager” method: training is computationally more expensive thanclassification
§ Quite robust to the presence of noise
§ In general, good classification accuracy comparable with otherclassification methods
87
87
DTI other issues
§ Decision tree algorithms divide the training data into smaller andsmaller subsets in a recursive fashion.
§ Problems§ Data fragmentation§ Repetition§ Replication
§ Data fragmentation§ Number of instances at the leaf nodes can be too small to make any
statistically significant decision
88
88
DTI other issues
§ Repetition§ An attribute is repeatedly tested
along some branch of thedecision tree
§ Replication• A decision tree contains
duplicate subtrees
P
Q R
S 0 1
0 1
Q
S 0
0 1
89
89
DTI other issues decision boundary
•Border line between two neighbouring regions of different classes is knownas decision boundary
•Decision boundary is parallel to axes because test condition involves asingle attribute atatime
90
90
DTI other issues multivariate split
§ Multivariate splits based on a combination of attributes§ More expressive representation§ The use of multivariate splits can prevent problems of fragmentation, repetition
and replication.§ Finding optimal test condition is computationally expensive.
x + y < 1
Class = + Class =
91
91
DTI other issues
§ Decision tree induction is a widely studied topic different kind ofenhancements to the basic algorithm have been developed.§ challenges arising from real world data: quantitative attributes,
missing values, noise, outliers
§ multivariate decision trees
§ incremental decision tree induction§ updatable decision trees
§ scalable decision tree induction§ Scalability: Classifying data sets with millions of examples and hundreds
of attributes with reasonable speed
92
92
DTI other issues
§ C4.5 is a kind of reference algorithm used in machine learningresearch.
§ In this course we will use See5, a descendant of C4.5.§ A demonstration version of See5 is freely available from
§ http://www.rulequest.com/download.html
§ The source code of C4.5 is freely available for research and teachingfrom§ http://www.rulequest.com/Personal/c4.5r8.tar.gz§ written in C
93
93
References
§ These slides are partly based on the slides of the books:Han J, Kamber M. Data Mining: Concepts and Techniques, Morgan Kaufmann,
2006§ http://wwwsal.cs.uiuc.edu/~hanj/bk2/Tan PN, Steinbach M, Kumar V. Introduction to Data Mining, AddisonWesley,
2006§ http://wwwusers.cs.umn.edu/~kumar/dmbook/
§ Hand D, Mannila H, Smyth P. Principles of Data Mining, MIT Press, 2001§ Mitchell TM. Machine learning. McGrawHill, 1997.§ Quinlan JR. Induction of decision trees. Machine Learning 1: 81106, 1986§ Quinlan JR. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.§ Quinlan JR. See5. http://www.rulequest.com