Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan...
-
Upload
holly-bryant -
Category
Documents
-
view
220 -
download
1
Transcript of Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan...
![Page 1: Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length.](https://reader035.fdocuments.in/reader035/viewer/2022062305/5697bff51a28abf838cbdb85/html5/thumbnails/1.jpg)
Machine Learning
Decision Trees
![Page 2: Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length.](https://reader035.fdocuments.in/reader035/viewer/2022062305/5697bff51a28abf838cbdb85/html5/thumbnails/2.jpg)
E. Keogh, UC Riverside
Decision Tree ClassifierDecision Tree Classifier
Ross Quinlan
An
tenn
a L
engt
hA
nte
nna
Len
gth
10
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
Abdomen LengthAbdomen Length
Abdomen LengthAbdomen Length > 7.1?
no yes
KatydidAntenna LengthAntenna Length > 6.0?
no yes
KatydidGrasshopper
![Page 3: Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length.](https://reader035.fdocuments.in/reader035/viewer/2022062305/5697bff51a28abf838cbdb85/html5/thumbnails/3.jpg)
E. Keogh, UC Riverside
Grasshopper
Antennae shorter than body?
Cricket
Foretiba has ears?
Katydids Camel Cricket
Yes
Yes
Yes
No
No
3 Tarsi?
No
Decision trees predate computers
![Page 4: Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length.](https://reader035.fdocuments.in/reader035/viewer/2022062305/5697bff51a28abf838cbdb85/html5/thumbnails/4.jpg)
E. Keogh, UC Riverside
• Decision tree – A flow-chart-like tree structure
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample– Test the attribute values of the sample against the decision tree
Decision Tree ClassificationDecision Tree Classification
![Page 5: Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length.](https://reader035.fdocuments.in/reader035/viewer/2022062305/5697bff51a28abf838cbdb85/html5/thumbnails/5.jpg)
E. Keogh, UC Riverside
• Basic algorithm (a greedy algorithm)– Tree is constructed in a top-down recursive divide-and-conquer
manner– At start, all the training examples are at the root– Attributes are categorical (if continuous-valued, they can be
discretized in advance)– Examples are partitioned recursively based on selected attributes.– Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)• Conditions for stopping partitioning
– All samples for a given node belong to the same class– There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf– There are no samples left
How do we construct the decision tree?How do we construct the decision tree?
![Page 6: Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length.](https://reader035.fdocuments.in/reader035/viewer/2022062305/5697bff51a28abf838cbdb85/html5/thumbnails/6.jpg)
E. Keogh, UC Riverside
Information Gain as A Splitting CriteriaInformation Gain as A Splitting Criteria• Select the attribute with the highest information gain
(information gain is the expected reduction in entropy).
• Assume there are two classes, P and N
– Let the set of examples S contain p elements of class P and n
elements of class N
– The amount of information, needed to decide if an arbitrary example
in S belongs to P or N is defined as
np
n
np
n
np
p
np
pSE 22 loglog)(
0 log(0) is defined as 0
![Page 7: Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length.](https://reader035.fdocuments.in/reader035/viewer/2022062305/5697bff51a28abf838cbdb85/html5/thumbnails/7.jpg)
E. Keogh, UC Riverside
Information Gain in Decision Information Gain in Decision Tree InductionTree Induction
• Assume that using attribute A, a current set will be partitioned into some number of child sets
• The encoding information that would be gained by branching on A
)()()( setschildallEsetCurrentEAGain
Note: entropy is at its minimum if the collection of objects is completely uniform
![Page 8: Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length.](https://reader035.fdocuments.in/reader035/viewer/2022062305/5697bff51a28abf838cbdb85/html5/thumbnails/8.jpg)
E. Keogh, UC Riverside
Person Hair Length
Weight Age Class
Homer 0” 250 36 M
Marge 10” 150 34 F
Bart 2” 90 10 M
Lisa 6” 78 8 F
Maggie 4” 20 1 F
Abe 1” 170 70 M
Selma 8” 160 41 F
Otto 10” 180 38 M
Krusty 6” 200 45 M
Comic 8” 290 38 ?
![Page 9: Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length.](https://reader035.fdocuments.in/reader035/viewer/2022062305/5697bff51a28abf838cbdb85/html5/thumbnails/9.jpg)
How to Choose the Most descriptive Rule?
![Page 10: Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length.](https://reader035.fdocuments.in/reader035/viewer/2022062305/5697bff51a28abf838cbdb85/html5/thumbnails/10.jpg)
Entropy• Entropy (disorder, impurity) of a set of examples, S, relative to a
binary classification is:
where p1 is the fraction of positive examples in S and p0 is the fraction of negatives.
• If all examples are in one category, entropy is zero (we define 0log(0)=0)
• If examples are equally mixed (p1=p0=0.5), entropy is a maximum of 1.
• Entropy can be viewed as the number of bits required on average to encode the class of an example in S where data compression (e.g. Huffman coding) is used to give shorter codes to more likely cases.
• For multi-class problems with c categories, entropy generalizes to:
)(log)(log)( 020121 ppppSEntropy
c
iii ppSEntropy
12 )(log)(
![Page 11: Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length.](https://reader035.fdocuments.in/reader035/viewer/2022062305/5697bff51a28abf838cbdb85/html5/thumbnails/11.jpg)
Entropy Plot for Binary Classification
![Page 12: Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length.](https://reader035.fdocuments.in/reader035/viewer/2022062305/5697bff51a28abf838cbdb85/html5/thumbnails/12.jpg)
Information Gain• The information gain of a feature F is the expected reduction in
entropy resulting from splitting on this feature.
where Sv is the subset of S having value v for feature F.
• Entropy of each resulting subset weighted by its relative size.
)()(),()(
vFValuesv
v SEntropyS
SSEntropyFSGain
![Page 13: Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length.](https://reader035.fdocuments.in/reader035/viewer/2022062305/5697bff51a28abf838cbdb85/html5/thumbnails/13.jpg)
E. Keogh, UC Riverside
Hair Length <= 5?yes no
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911
Entropy(1F,3M) = -(1/4)log2(1/4) - (3/4)log
2(3/4)
= 0.8113
Entropy(3F,2M) = -(3/5)log2(3/5) - (2/5)log
2(2/5)
= 0.9710
np
n
np
n
np
p
np
pSEntropy 22 loglog)(
Gain(Hair Length <= 5) = 0.9911 – (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911
)()()( setschildallEsetCurrentEAGain
Let us try splitting on Hair length
Let us try splitting on Hair length
![Page 14: Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length.](https://reader035.fdocuments.in/reader035/viewer/2022062305/5697bff51a28abf838cbdb85/html5/thumbnails/14.jpg)
E. Keogh, UC Riverside
Weight <= 160?yes no
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911
Entropy(4F,1M) = -(4/5)log2(4/5) - (1/5)log
2(1/5)
= 0.7219
Entropy(0F,4M) = -(0/4)log2(0/4) - (4/4)log
2(4/4)
= 0
np
n
np
n
np
p
np
pSEntropy 22 loglog)(
Gain(Weight <= 160) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900
)()()( setschildallEsetCurrentEAGain
Let us try splitting on Weight
Let us try splitting on Weight
![Page 15: Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length.](https://reader035.fdocuments.in/reader035/viewer/2022062305/5697bff51a28abf838cbdb85/html5/thumbnails/15.jpg)
E. Keogh, UC Riverside
age <= 40?yes no
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911
Entropy(3F,3M) = -(3/6)log2(3/6) - (3/6)log
2(3/6)
= 1
Entropy(1F,2M) = -(1/3)log2(1/3) - (2/3)log
2(2/3)
= 0.9183
np
n
np
n
np
p
np
pSEntropy 22 loglog)(
Gain(Age <= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183
)()()( setschildallEsetCurrentEAGain
Let us try splitting on Age
Let us try splitting on Age
![Page 16: Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length.](https://reader035.fdocuments.in/reader035/viewer/2022062305/5697bff51a28abf838cbdb85/html5/thumbnails/16.jpg)
E. Keogh, UC Riverside
Weight <= 160?yes no
Hair Length <= 2?yes no
Of the 3 features we had, Weight was best. But while people who weigh over 160 are perfectly classified (as males), the under 160 people are not perfectly classified… RECURSION!
This time we find that we can split on Hair length, and we are done!
![Page 17: Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length.](https://reader035.fdocuments.in/reader035/viewer/2022062305/5697bff51a28abf838cbdb85/html5/thumbnails/17.jpg)
E. Keogh, UC Riverside
Weight <= 160?
yes no
Hair Length <= 2?
yes no
We don’t need to keep the data around, just the test conditions.
Male
Male Female
How would these people be classified?
![Page 18: Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length.](https://reader035.fdocuments.in/reader035/viewer/2022062305/5697bff51a28abf838cbdb85/html5/thumbnails/18.jpg)
E. Keogh, UC Riverside
It is trivial to convert Decision Trees to rules…
Weight <= 160?
yes no
Hair Length <= 2?
yes no
Male
Male Female
Rules to Classify Males/Females
If Weight greater than 160, classify as MaleElseif Hair Length less than or equal to 2, classify as MaleElse classify as Female
Rules to Classify Males/Females
If Weight greater than 160, classify as MaleElseif Hair Length less than or equal to 2, classify as MaleElse classify as Female
![Page 19: Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length.](https://reader035.fdocuments.in/reader035/viewer/2022062305/5697bff51a28abf838cbdb85/html5/thumbnails/19.jpg)
E. Keogh, UC Riverside
Decision tree for a typical shared-care setting applying the system Decision tree for a typical shared-care setting applying the system for the diagnosis of prostatic obstructions.for the diagnosis of prostatic obstructions.
Once we have learned the decision tree, we don’t even need a computer!Once we have learned the decision tree, we don’t even need a computer!
This decision tree is attached to a medical machine, and is designed to help
nurses make decisions about what type of doctor to call.
![Page 20: Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length.](https://reader035.fdocuments.in/reader035/viewer/2022062305/5697bff51a28abf838cbdb85/html5/thumbnails/20.jpg)
E. Keogh, UC Riverside
Wears green?
Yes No
The worked examples we have seen were performed on small datasets. However with small datasets there is a great danger of overfitting the data…
When you have few data points, there are many possible splitting rules that perfectly classify the data, but will not generalize to future datasets.
For example, the rule “Wears green?” perfectly classifies the data, so does “Mothers name is Jacqueline?”, so does “Has blue shoes”…
MaleFemale
![Page 21: Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length.](https://reader035.fdocuments.in/reader035/viewer/2022062305/5697bff51a28abf838cbdb85/html5/thumbnails/21.jpg)
E. Keogh, UC Riverside
Avoid Overfitting in ClassificationAvoid Overfitting in Classification• The generated tree may overfit the training data
– Too many branches, some may reflect anomalies due to noise or outliers
– Result is in poor accuracy for unseen samples
• Two approaches to avoid overfitting – Prepruning: Halt tree construction early—do not split
a node if this would result in the goodness measure falling below a threshold
• Difficult to choose an appropriate threshold– Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees• Use a set of data different from the training data
to decide which is the “best pruned tree”
![Page 22: Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length.](https://reader035.fdocuments.in/reader035/viewer/2022062305/5697bff51a28abf838cbdb85/html5/thumbnails/22.jpg)
E. Keogh, UC Riverside
10
1 2 3 4 5 6 7 8 9 10
123456789
100
10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
90
10
1 2 3 4 5 6 7 8 9 10
123456789
Which of the “Pigeon Problems” can be Which of the “Pigeon Problems” can be solved by a Decision Tree?solved by a Decision Tree?
1)1) Deep Bushy TreeDeep Bushy Tree2)2) UselessUseless3)3) Deep Bushy TreeDeep Bushy Tree
The Decision Tree has a hard time with correlated attributes ?
![Page 23: Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length.](https://reader035.fdocuments.in/reader035/viewer/2022062305/5697bff51a28abf838cbdb85/html5/thumbnails/23.jpg)
UT Austin, R. Mooney
Cross-Validating without Losing Training Data
• If the algorithm is modified to grow trees breadth-first rather than depth-first, we can stop growing after reaching any specified tree complexity.
• First, run several trials of reduced error-pruning using different random splits of grow and validation sets.
• Record the complexity of the pruned tree learned in each trial. Let C be the average pruned-tree complexity.
• Grow a final tree breadth-first from all the training data but stop when the complexity reaches C.
• Similar cross-validation approach can be used to set arbitrary algorithm parameters in general.
![Page 24: Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length.](https://reader035.fdocuments.in/reader035/viewer/2022062305/5697bff51a28abf838cbdb85/html5/thumbnails/24.jpg)
E. Keogh, UC Riverside
• Advantages:– Easy to understand (Doctors love them!) – Easy to generate rules
• Disadvantages:– May suffer from overfitting.– Classifies by rectangular partitioning (so does
not handle correlated features very well).– Can be quite large – pruning is necessary.– Does not handle streaming data easily
Advantages/Disadvantages of Decision TreesAdvantages/Disadvantages of Decision Trees
![Page 25: Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length.](https://reader035.fdocuments.in/reader035/viewer/2022062305/5697bff51a28abf838cbdb85/html5/thumbnails/25.jpg)
UT Austin, R. Mooney
Additional Decision Tree Issues
• Better splitting criteria– Information gain prefers features with many values.
• Continuous features• Predicting a real-valued function (regression trees)• Missing feature values• Features with costs• Misclassification costs• Incremental learning
– ID4– ID5
• Mining large databases that do not fit in main memory
![Page 26: Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length.](https://reader035.fdocuments.in/reader035/viewer/2022062305/5697bff51a28abf838cbdb85/html5/thumbnails/26.jpg)