Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. ·...
Transcript of Decision Tree - Sharifce.sharif.edu/courses/91-92/1/ce717-2/resources/root... · 2020. 9. 7. ·...
Decision Tree CE-717 : Machine Learning Sharif University of Technology
M. Soleymani
Fall 2012
Some slides have been adapted from: Prof. Tom Mitchell
Decision tree
Approximating functions of usually discrete domain
The learned function is represented by a decision tree
Application: Database mining
2
PlayTennis example: Training samples
3
A PlayTennis example
Feature 1: Humidity
Feature 2: Wind
Feature 3: Outlook
Feature 4: Temprature
Output: Play Tennis
Example:
𝒙=[low, weak, rain, hot] 𝑦=𝑁𝑜
4
PlayTennis example: decision tree
5
Decision tree structure
Internal node: a test on some attributes
Branch: one of the possible results of the test
Leaf: predicting y
6
Decision tree learning:
Function approximation problem
Problem Setting:
Set of possible instances 𝑋
Unknown target function 𝑓: 𝑋 → 𝑌 (𝑌 is discrete valued)
Set of function hypotheses 𝐻 = * | ∶ 𝑋 → 𝑌 +
is a DT where tree sorts each 𝒙 to a leaf which assigns 𝑦
Input:
Training examples *(𝒙 𝑖 , 𝑦 𝑖 )+ of unknown target function 𝑓
Output:
Hypothesis ∈ 𝐻 that best approximates target function 𝑓
7
Decision trees hypothesis space
Suppose attributes are Boolean
Disjunction of conjunctions
Which trees to show the following functions?
𝑦 = 𝑥1 𝑎𝑛𝑑 𝑥3
𝑦 = 𝑥1 𝑜𝑟 𝑥3
𝑦 = (𝑥1 𝑎𝑛𝑑 𝑥3) 𝑜𝑟(𝑥2 𝑎𝑛𝑑 ¬𝑥4)
8
Decision tree as a rule base
Decision tree = a set of rules
Disjunctions of conjunctions of constraints on the attribute
values of instances
Each path from root to a leaf = conjunction of attribute tests
All of the leafs with 𝑦 = 𝑖 are considered to find rule for 𝑦 = 𝑖
9
How to construct basic decision tree?
Learning an optimal decision tree is NP-Complete
Instead, we use a greedy search based on a heuristic
Which attribute at the root?
The attribute 𝑨 that is the best classifier
Measure: how well splits the set into homogeneous subsets (having same value of target)
We prefer decisions leading to a simple, compact tree with few nodes
How to form descendant?
Descendant is created for each possible value of 𝐴
Training examples are sorted to descendant nodes
10
Basic decision tree learning algorithm (ID3)
ID3(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠, 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠)
1) Create a root node 𝑟
2) If all examples have the same label or 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 = *+ return 𝑟 with proper label
3) 𝐴 ← the best classifier attribute on 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠
4) For each value of 𝐴 add a new branch below 𝑟 and sort
examples accordingly
5) Recursively grows the tree under each branch
Top down, Greedy, No backtrack
11
ID3
12
•ID3 (Examples, Target_Attribute, Attributes)
•Create a root node for the tree
•If all examples are positive, return the single-node tree Root, with label = +
•If all examples are negative, return the single-node tree Root, with label = -
•If number of predicting attributes is empty then
• return Root, with label = most common value of the target attribute in the examples
•else
•A = The Attribute that best classifies examples.
•Testing attribute for Root = A.
•for each possible value, 𝑣𝑖, of A
•Add a new tree branch below Root, corresponding to the test A = .
•Let Examples(𝑣𝑖) be the subset of examples that have the value for A
•if Examples(𝑣𝑖) is empty then
• below this new branch add a leaf node with label = most common target value in the examples
•else below this new branch add subtree ID3 (Examples(𝒗𝒊), Target_Attribute, Attributes – {A})
•return Root
Which attribute is the best?
13
Which attribute is the best?
14
A variety of heuristics for picking a good test
Information gain: originated with ID3 (Quinlan,1979).
Gini impurity
…
Entropy
𝐻 𝑋 = − 𝑃 𝑥𝑖 𝑙𝑜𝑔𝑃(𝑥𝑖)𝑥𝑖∈𝑋
Amount of uncertainty associated with a specific
probability distribution
15
Entropy
Information theory:
𝐻 𝑋 : expected number of bits needed to encode a randomly
drawn value of 𝑋 (under most efficient code)
Most efficient code assigns −log 𝑃(𝑋 = 𝑖) bits to encode 𝑋 = 𝑖
⇒ expected number of bits to code one random 𝑋 is 𝐻(𝑋)
16
Entropy for a Boolean variable
𝐻(𝑋)
𝑃(𝑋 = 1)
𝐻 𝑋 = −1 log2 1 − 0 log2 0 = 0 𝐻 𝑋 = −0.5 log21
2− 0.5 log2
1
2= 1
17
Entropy as a measure
of impurity
Conditional entropy
𝐻 𝑌 𝑋 = − 𝑃 𝑋 = 𝑖, 𝑌 = 𝑗 𝑙𝑜𝑔𝑃 𝑌 = 𝑗|𝑋 = 𝑖𝑗𝑖
18
𝐻 𝑌 𝑋 = 𝑃 𝑋 = 𝑖 −𝑃 𝑌 = 𝑗|𝑋 = 𝑖 𝑙𝑜𝑔𝑃 𝑌 = 𝑗|𝑋 = 𝑖𝑗𝑖
probability of following i-th branch for 𝑋
Entropy of 𝑌 for samples sorted to i-th branch
Mutual Information
𝐼 𝑋, 𝑌 = 𝐻 𝑋 − 𝐻 𝑋 𝑌 = 𝐻 𝑌 − 𝐻 𝑌 𝑋
The expected reduction in entropy caused by partitioning
examples according to this attribute
𝐻 𝑌 : Entropy of Y before you split
𝐻 𝑌 𝑋 : Entropy after split
19
Information Gain
𝐺𝑎𝑖𝑛 𝑆, 𝐴 = 𝐼𝑆 𝐴, 𝑌 = 𝐻𝑆 𝑌 − 𝐻𝑆 𝑌|𝐴
𝑌: target variable
𝑆: samples
𝐴: variable used to sort samples
𝐺𝑎𝑖𝑛 𝑆, 𝐴 ≡ 𝐻𝑆 𝑌 − 𝑆𝑣𝑆𝐻𝑆𝑣 𝑌
𝑣∈Values(𝐴)
20
Information Gain: Example
21
How to find the best attribute classifier?
Information gain as our criteria for a good split
attribute that maximizes information gain at each node
Choose 𝑗-th attribute: 𝑗 = argmax
𝑖𝐼𝐺 𝑋𝑖
= argmax𝑖𝐻 𝑌 − 𝐻 𝑌|𝑋𝑖
22
Example
23
Decision tree algorithm
24
The algorithm
either reaches homogenous nodes
or runs out of attributes
Guaranteed to find a tree consistent with any conflict-free
training set
Conflict free training set: identical feature vectors always assigned the
same class
But not necessarily find the simplest tree.
How partition instance space?
25
Decision tree
Partition the instance space into axis-parallel regions, labeled with
class value
Source: Duda & Hurt ’s Book
26
Bias in decision tree induction
Information-gain gives a bias for trees with minimal depth.
Implements a search (preference) bias instead of a
language (restriction) bias.
Which Tree Should We Output?
ID3: heuristic search through space of DTs
Stops at smallest acceptable tree
Occam’s razor: prefer the
simplest hypothesis that
fits the data
27
Ockham (1285-1349) Principle of Parsimony:
“One should not increase, beyond what is necessary,
the number of entities required to explain anything.”
Why prefer short hypotheses?
(Occam’s Razor)
Fewer short hypotheses than long ones
a short hypothesis that fits the data is less likely to be a
statistical coincidence
highly probable that a sufficiently complex hypothesis will fit
the data
28
Over-fitting problem
ID3 usually perfectly classifies training data
It tries to memorize every training data
Poor decisions when very little data and may not reflect
reliable trends
A node that “should” be pure but had a single exception?
Noise in the training data: the tree is erroneously fitting.
For many (non relevant) attributes, the algorithm will
continue to split nodes
leads to over-fitting!
29
Over-fitting problem: an example
Consider adding a (noisy) training example:
How does the tree change?
PlayTennis Wind Humidity Temp Outlook
No Strong Normal Hot Sunny
30
General definition of overfitting
Hypothesis ∈ 𝐻
Training error: 𝑒𝑟𝑟𝑜𝑟𝑡𝑟𝑎𝑖𝑛()
Expected error: 𝑒𝑟𝑟𝑜𝑟𝑡𝑟𝑢𝑒()
overfits training data if there is a ′ ∈ 𝐻 such that
𝑒𝑟𝑟𝑜𝑟𝑡𝑟𝑎𝑖𝑛 < 𝑒𝑟𝑟𝑜𝑟𝑡𝑟𝑎𝑖𝑛(′)
𝑒𝑟𝑟𝑜𝑟𝑡𝑟𝑢𝑒 > 𝑒𝑟𝑟𝑜𝑟𝑡𝑟𝑢𝑒(′)
31
Over-fitting in decision tree learning
32
A question?
33
How can it be made smaller and simpler?
When should a node be declared a leaf?
If a leaf node is impure, how should the category label be assigned?
Pruning?
Avoiding overfitting
Stop growing when the data split is not statistically
significant.
Grow full tree and then prune it.
More successful than stop growing in practice.
How to select “best” tree:
Measure performance over separate validation set
MDL: minimize
𝑠𝑖𝑧𝑒 𝑡𝑟𝑒𝑒 + 𝑠𝑖𝑧𝑒(𝑚𝑖𝑠𝑠𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑠(𝑡𝑟𝑒𝑒))
34
Reduced-error pruning
Split data into train and validation set
Build tree using training set
Do until further pruning is harmful:
Evaluate impact on validation set when pruning sub-tree
rooted at each node
Temporarily remove sub-tree rooted at node
Replace it with a leaf labeled with the current majority class at that node
Measure and record error on validation set
Greedily remove the one that most improves validation set
accuracy (if any).
35
Produces smallest version of the most accurate sub-tree.
Rule post-pruning
Convert tree to equivalent set of rules
Prune each rule independently of others
Sort final rules into desired sequence for use
Perhaps most frequently used method (e.g., C4.5)
36
Continuous attributes
Tests on continuous variables as boolean ?
Either use threshold to turn into binary or discretize
Its possible to compute information gain for all possible
thresholds (there are a finite number of training samples)
Harder if we wish to assign more than two values (can be
done recursively)
37
38
More issues on decision trees
Better splitting criteria
Information gain prefers features with many values.
Missing feature values
Features with costs
Misclassification costs
Incremental learning (ID4, ID5)
Regression trees
Ranking classifiers
Rich Caruana & Alexandru Niculescu-Mizil, An Empirical Comparison of Supervised Learning
Algorithms, ICML 2006
Top 8 are all based on various extensions of
decision trees
39
Decision tree advantages
40
Simple to understand and interpret
Requires little data preparation
Can handle both numerical and categorical data
Performs well with large data in a short time
Robust
Performs well even if its assumptions are somewhat violated