1 Classification with Decision Trees II Instructor: Qiang Yang Hong Kong University of Science and...
-
date post
21-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of 1 Classification with Decision Trees II Instructor: Qiang Yang Hong Kong University of Science and...
1
Classification with Decision Trees II
Instructor: Qiang YangHong Kong University of Science and Technology
Thanks: Eibe Frank and Jiawei Han
2
Part II: Industrial-strength algorithms
Requirements for an algorithm to be useful in a wide range of real-world applications:
Can deal with numeric attributes Doesn’t fall over when missing values are present Is robust in the presence of noise Can (at least in principle) approximate arbitrary
concept descriptions Basic schemes (may) need to be extended to
fulfill these requirements
3
Decision trees
Extending ID3 to deal with numeric attributes: pretty straightforward
Dealing sensibly with missing values: a bit trickier
Stability for noisy data: requires sophisticated pruning mechanism
End result of these modifications: Quinlan’s C4.5
Best-known and (probably) most widely-used learning algorithm
Commercial successor: C5.0
4
Numeric attributes
Standard method: binary splits (i.e. temp < 45)
Difference to nominal attributes: every attribute offers many possible split points
Solution is straightforward extension: Evaluate info gain (or other measure) for every
possible split point of attribute Choose “best” split point Info gain for best split point is info gain for attribute
Computationally more demanding
5
An example
Split on temperature attribute from weather data:
Eg. 4 yeses and 2 nos for temperature < 71.5 and 5 yeses and 3 nos for temperature 71.5
Info([4,2],[5,3]) = (6/14)info([4,2]) + (8/14)info([5,3]) = 0.939 bits
Split points are placed halfway between values All split points can be evaluated in one pass!
64 65 68 69 70 71 72 72 75 75 80 81 83 85Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No
6
Avoiding repeated sorting
Instances need to be sorted according to the values of the numeric attribute considered
Time complexity for sorting: O(n log n) Does this have to be repeated at each node? No! Sort order from parent node can be used
to derive sort order for children Time complexity of derivation: O(n) Only drawback: need to create and store an array of
sorted indices for each numeric attribute
7
Notes on binary splits
Information in nominal attributes is computed using one multi-way split on that attribute
This is not the case for binary splits on numeric attributes
The same numeric attribute may be tested several times along a path in the decision tree
Disadvantage: tree is relatively hard to read Possible remedies: pre-discretization of
numeric attributes or multi-way splits instead of binary ones
8
Example of Binary Split
Age<=3
Age<=5
Age<=10
9
Missing values
C4.5 splits instances with missing values into pieces (with weights summing to 1)
A piece going down a particular branch receives a weight proportional to the popularity of the branch
Info gain etc. can be used with fractional instances using sums of weights instead of counts
During classification, the same procedure is used to split instances into pieces
Probability distributions are merged using weights
10
Stopping Criteria
When all cases have the same class. The leaf node is labeled by this class.
When there is no available attribute. The leaf node is labeled by the majority class.
When the number of cases is less than a specified threshold. The leaf node is labeled by the majority class.
11
Pruning
Pruning simplifies a decision tree to prevent overfitting to noise in the data
Two main pruning strategies:1. Postpruning: takes a fully-grown decision tree and
discards unreliable parts2. Prepruning: stops growing a branch when
information becomes unreliable Postpruning preferred in practice because of
early stopping in prepruning
12
Prepruning
Usually based on statistical significance test Stops growing the tree when there is no
statistically significant association between any attribute and the class at a particular node
Most popular test: chi-squared test ID3 used chi-squared test in addition to
information gain Only statistically significant attributes where allowed
to be selected by information gain procedure
13
The Weather example: Observed Count
Play
Outlook
Yes No Outlook Subtotal
Sunny 2 0 2
Cloudy 0 1 1
PlaySubtotal:
2 1 Total count in table =3
14
The Weather example: Expected Count
Play
Outlook
Yes No Subtotal
Sunny 2*2/6=4/3=1.3
2*1/6=2/3=0.6
2
Cloudy 2*1/3=0.6 1*1/3=0.3 1
Subtotal: 2 1 Total count in table =3
If attributes were independent, then the subtotals would be Like this
15
Question: How different between observed and expected?
•If Chi-squared value is very large, then A1 and A2 are not independent that is, they are dependent!
•Degrees of freedom: if table has n*m items, then freedom = (n-1)*(m-1)
•If all attributes in a node are independent with the class attribute, then stop splitting further.
16
Postpruning
Builds full tree first and prunes it afterwards Attribute interactions are visible in fully-grown tree
Problem: identification of subtrees and nodes that are due to chance effects
Two main pruning operations: 1. Subtree replacement2. Subtree raising
Possible strategies: error estimation, significance testing, MDL principle
17
Subtree replacement
Bottom-up: tree is considered for replacement once all its subtrees have been considered
18
Subtree raising
Deletes node and redistributes instances Slower than subtree replacement
(Worthwhile?)
19
Estimating error rates
Pruning operation is performed if this does not increase the estimated error
Of course, error on the training data is not a useful estimator (would result in almost no pruning)
One possibility: using hold-out set for pruning (reduced-error pruning)
C4.5’s method: using upper limit of 25% confidence interval derived from the training data
Standard Bernoulli-process-based method
Outlook Tempreature Humidity Windy Classsunny hot high false Nsunny hot high true Novercast hot high false Prain mild high false Prain cool normal false Prain cool normal true Novercast cool normal true Psunny mild high false Nsunny cool normal false Prain mild normal false Psunny mild normal true Povercast mild high true Povercast hot normal false Prain mild high true N
Training Set
21
Post-pruning in C4.5 Bottom-up pruning: at each non-leaf node v, if
merging the subtree at v into a leaf node improves accuracy, perform the merging.
Method 1: compute accuracy using examples not seen by the algorithm.
Method 2: estimate accuracy using the training examples: Consider classifying E examples incorrectly out of
N examples as observing E events in N trials in the binomial distribution.
For a given confidence level CF, the upper limit on the error rate over the whole population is with CF% confidence. ),( NEUCF
22
Usage in Statistics: Sampling error estimation Example:
population: 1,000,000 people, could be regarded as infinite population mean: percentage of the left handed people sample: 100 people sample mean: 6 left-handed How to estimate the REAL population mean?
Pessimistic Estimate
U0.25(100,6)L0.25(100,6)6
Possibility(%)
2 10
75% confidence interval15%
23
Usage in Decision Tree (DT): error estimation for some node in the DT
example: unknown testing data: could be regarded as infinite universe population mean: percentage of error made by this node sample: 100 examples from training data set sample mean: 6 errors for the training data set How to estimate the REAL average error rate?
Pessimistic Estimate
U0.25(100,6)L0.25(100,6)6
Possibility(%)
2 10
75% confidence interval
Heuristic!But works well...
24
C4.5’s method
Error estimate for subtree is weighted sum of error estimates for all its leaves
Error estimate for a node:
If c = 25% then z = 0.69 (from normal distribution)
f is the error on the training data N is the number of instances covered by the
leaf
Nz
N
zNf
Nf
zNz
fe2
2
222
142
25
Example for Estimating Error
Consider a subtree rooted at Outlook with 3 leaf nodes: Sunny: Play = yes : (0 error, 6 instances) Overcast: Play= yes: (0 error, 9 instances) Cloudy: Play = no (0 error, 1 instance)
The estimated error for this subtree is 6*0.206+9*0.143+1*0.750=3.273
If the subtree is replaced with the leaf “yes”, the estimated error is
So the pruning is performed and the tree is merged (see next page)
750.0)0,1(,143.0)0,9(,206.0)0,6( 25.025.025.0 UUU
512.2157.0*16)1,16(*16 25.0 U
26
Example continued
Outlook
yes yes no
sunny
overcast
cloudy yes
27
Example
f=0.33 e=0.47 f=0.5
e=0.72f=0.33 e=0.47
f=5/14 e=0.46
Combined usingratios 6:2:6this gives 0.51
28
Complexity of tree induction
Assume m attributes, n training instances and a tree depth of O(log n)
Cost for building a tree: O(mn log n) Complexity of subtree replacement: O(n) Complexity of subtree raising: O(n (log n)2)
Every instance may have to be redistributed at every node between its leaf and the root: O(n log n)
Cost for redistribution (on average): O(log n) Total cost: O(mn log n) + O(n (log n)2)
29
The CART Algorithm
)()( ii
i TsdT
TTsdSDR
Tx
xxPTSD 2)(*)()(
k
jjjkk xwxwxwxwxwy
0
)1()1()1(22
)1(11
)1(00
)1( ...
yXXXWT
T 1)(
30
Numeric prediction Counterparts exist for all schemes that we previously discussed
Decision trees, rule learners, SVMs, etc.
All classification schemes can be applied to regression problems using discretization
Prediction: weighted average of intervals’ midpoints (weighted according to class probabilities)
Regression more difficult than classification (i.e. percent correct vs. mean squared error)
31
Regression trees
Differences to decision trees: Splitting criterion: minimizing intra-subset variation Pruning criterion: based on numeric error measure Leaf node predicts average class values of training
instances reaching that node Can approximate piecewise constant functions Easy to interpret More sophisticated version: model trees
32
Model trees
Regression trees with linear regression functions at each node
Linear regression applied to instances that reach a node after full regression tree has been built
Only a subset of the attributes is used for LR Attributes occurring in subtree (+maybe attributes
occurring in path to the root) Fast: overhead for LR not large because usually
only a small subset of attributes is used in tree
33
Smoothing
Naïve method for prediction outputs value of LR for corresponding leaf node
Performance can be improved by smoothing predictions using internal LR models
Predicted value is weighted average of LR models along path from root to leaf
Smoothing formula: Same effect can be achieved by incorporating
the internal models into the leaf nodes
kn
kqnpp
34
Building the tree
Splitting criterion: standard deviation reduction
Termination criteria (important when building trees for numeric prediction):
Standard deviation becomes smaller than certain fraction of sd for full training set (e.g. 5%)
Too few instances remain (e.g. less than four)
)()( ii
i TsdT
TTsdSDR
35
Pruning
Pruning is based on estimated absolute error of LR models
Heuristic estimate: LR models are pruned by greedily removing
terms to minimize the estimated error Model trees allow for heavy pruning: often a
single LR model can replace a whole subtree Pruning proceeds bottom up: error for LR
model at internal node is compared to error for subtree
orsolute_erraverage_abvn
n
36
Nominal attributes
Nominal attributes are converted into binary attributes (that can be treated as numeric ones)
Nominal values are sorted using average class val. If there are k values, k-1 binary attributes are
generated The ith binary attribute is 0 if an instance’s value is one
of the first i in the ordering, 1 otherwise It can be proven that the best split on one of
the new attributes is the best binary split on original
But M5‘ only does the conversion once
37
Missing values Modified splitting criterion:
Procedure for deciding into which subset the instance goes: surrogate splitting
Choose attribute for splitting that is most highly correlated with original attribute
Problem: complex and time-consuming Simple solution: always use the class
Testing: replace missing value with average
)()( i
i
i TsdT
TTsd
T
mSDR
38
Pseudo-code for M5’
Four methods: Main method: MakeModelTree() Method for splitting: split() Method for pruning: prune() Method that computes error: subtreeError()
We’ll briefly look at each method in turn Linear regression method is assumed to
perform attribute subset selection based on error
39
MakeModelTree()
MakeModelTree (instances){ SD = sd(instances) for each k-valued nominal attribute convert into k-1 synthetic binary attributes root = newNode root.instances = instances split(root) prune(root) printTree(root)}
40
split()split(node) { if sizeof(node.instances) < 4 or sd(node.instances) < 0.05*SD node.type = LEAF else node.type = INTERIOR for each attribute for all possible split positions of the attribute calculate the attribute's SDR node.attribute = attribute with maximum SDR split(node.left) split(node.right)}
41
prune()
prune(node){ if node = INTERIOR then prune(node.leftChild) prune(node.rightChild) node.model = linearRegression(node) if subtreeError(node) > error(node)
then node.type = LEAF}
42
subtreeError()
subtreeError(node){ l = node.left; r = node.right if node = INTERIOR then return (sizeof(l.instances)*subtreeError(l) + sizeof(r.instances)*subtreeError(r))
/sizeof(node.instances) else return error(node)}
43
Model tree for servo data
44
Variations of CART
Applying Logistic Regression predict probability of “True” or “False” instead of making a
numerical valued prediction predict a probability value (p) rather than the outcome itself Probability= odds ratio
ii XWp
p)
1log(
)(1
1WXe
p
45
Other Trees
• Classification Trees•Current node •Children nodes (L, R):
•Decision Trees •Current node •Children nodes (L, R):
•GINI index used in CART (STD= )•Current node •Children nodes (L, R):
),min(),min( 1010RRRLLL ppppppQ
),min( 10 ppQ
1100 loglog ppppQ
RRLL QpQpQ
10 pp10 ppQ
RRLL QpQpQ
47
Efforts on Scalability
Most algorithms assume data can fit in memory. Recent efforts focus on disk-resident implementation for
decision trees. Random sampling Partitioning Examples
SLIQ (EDBT’96 -- [MAR96]) SPRINT (VLDB96 -- [SAM96]) PUBLIC (VLDB98 -- [RS98]) RainForest (VLDB98 -- [GRG98])