Machine Learning III Decision Tree Induction CSE 473.
-
Upload
kendra-alman -
Category
Documents
-
view
227 -
download
0
Transcript of Machine Learning III Decision Tree Induction CSE 473.
© Daniel S. Weld 2
Machine Learning Outline
• Machine learning: √ Function approximation√ Bias
• Supervised learning√ Classifiers & concept learning√ Version Spaces (restriction bias) Decision-trees induction (pref bias)
• Overfitting• Ensembles of classifiers
© Daniel S. Weld 3
Two Strategies for ML
•Restriction bias: use prior knowledge to specify a restricted hypothesis space.
Version space algorithm over conjunctions.•Preference bias: use a broad hypothesis space, but impose an ordering on the hypotheses.
Decision trees.
© Daniel S. Weld 4
Decision Trees• Convenient Representation
Developed with learning in mind Deterministic
• Expressive Equivalent to propositional DNF Handles discrete and continuous parameters
• Simple learning algorithm Handles noise well Classify as follows
•Constructive (build DT by adding nodes)•Eager•Batch (but incremental versions exist)
© Daniel S. Weld 5
Concept Learning
• E.g. Learn concept “Edible mushroom” Target Function has two values: T or F
• Represent concepts as decision trees• Use hill climbing search • Thru space of decision trees
Start with simple concept Refine it into a complex concept as needed
© Daniel S. Weld 7
Decision Tree Representation
Outlook
Humidity Wind
YesYes
Yes
No No
Sunny Overcast Rain
High StrongNormal Weak
Good day for tennis?Leaves = classificationArcs = choice of valuefor parent attribute
Decision tree is equivalent to logic in disjunctive normal formG-Day (Sunny Normal) Overcast (Rain Weak)
© Daniel S. Weld 12
What is theSimplest
Tree?
Day Outlook Temp Humid Wind Play?d1 s h h w nd2 s h h s nd3 o h h w yd4 r m h w yd5 r c n w yd6 r c n s yd7 o c n s yd8 s m h w nd9 s c n w yd10 r m n w yd11 s m n s yd12 o m h s yd13 o h n w yd14 r m h s n
How good?
[10+, 4-]Means: correct on 10 examples incorrect on 4 examples
© Daniel S. Weld 14
To be decided:
• How to choose best attribute? Information gain Entropy (disorder)
• When to stop growing tree?
© Daniel S. Weld 17
Entropy (disorder) is badHomogeneity is good
• Let S be a set of examples• Entropy(S) = -P log2(P) - N log2(N)
where P is proportion of pos example and N is proportion of neg examples and 0 log 0 == 0
• Example: S has 10 pos and 4 negEntropy([10+, 4-]) = -(10/14) log2(10/14) -
(4/14)log2(4/14) = 0.863
© Daniel S. Weld 18
Information Gain
• Measure of expected reduction in entropy
• Resulting from splitting along an attribute
Gain(S,A) = Entropy(S) - (|Sv| / |S|) Entropy(Sv)
Where Entropy(S) = -P log2(P) - N log2(N)
v Values(A)
© Daniel S. Weld 19
Gain of Splitting on WindDay Wind Tennis?d1 weak nd2 s nd3 weak yesd4 weak yesd5 weak yesd6 s yesd7 s yesd8 weak nd9 weak yesd10 weak yesd11 s yesd12 s yesd13 weak yesd14 s n
Values(wind)=weak, strongS = [10+, 4-]
Gain(S, wind) = Entropy(S) - (|Sv| / |S|) Entropy(Sv)
= Entropy(S) - 8/14 Entropy(Sweak)
- 6/14 Entropy(Ss)
= 0.863 - (8/14) 0.811 - (6/14) 1.00 = -0.029
v {weak, s}
Sweak = [6+, 2-]
Ss = [3+, 3-]
© Daniel S. Weld 22
Evaluating AttributesYes
Outlook Temp
Humid Wind
Gain(S,Humid)=0.375
Gain(S,Outlook)=0.401
Gain(S,Temp)=0.029
Gain(S,Wind)=-0.029
Value is actually different than this, but ignore this detail
© Daniel S. Weld 23
Resulting Tree …. Outlook
Sunny Overcast Rain
Good day for tennis?
No[2+, 3-]
Yes[4+]
No[2+, 3-]
© Daniel S. Weld 24
Recurse!
Day Temp Humid Wind Tennis?d1 h h weak nd2 h h s nd8 m h weak nd9 c n weak yesd11 m n s yes
Outlook
Sunny
© Daniel S. Weld 25
One Step Later…Outlook
Humidity
Sunny Overcast Rain
HighNormal
Yes[2+]
Yes[4+]
No[2+, 3-]
No[3-]
© Daniel S. Weld 26
Decision Tree AlgorithmBuildTree(TraingData)
Split(TrainingData)
Split(D)If (all points in D are of the same class)
Then ReturnFor each attribute A
Evaluate splits on attribute AUse best split to partition D into D1, D2Split (D1)Split (D2)
© Daniel S. Weld 29
Missing Data 1
• Assign attribute …
most common value, … given classification
Day Temp Humid Wind Tennis?d1 h h weak nd2 h h s nd8 m h weak nd9 c weak yesd11 m n s yes
• Don’t use this instance for learning?
most common value at node, or
© Daniel S. Weld 30
Fractional Values
• 75% h and 25% n• Use in gain calculations• Further subdivide if other missing attributes• Same approach to classify test ex with missing attr
Classification is most probable classification Summing over leaves where it got divided
Day Temp Humid Wind Tennis?d1 h h weak nd2 h h s nd8 m h weak nd9 c weak yesd11 m n s yes
[0.75+, 3-]
[1.25+, 0-]
© Daniel S. Weld 34
Machine Learning Outline
• Machine learning: • Supervised learning• Overfitting
What is the problem? Reduced error pruning
• Ensembles of classifiers
© Daniel S. Weld 35
Overfitting
Number of Nodes in Decision tree
Accuracy
0.9
0.8
0.7
0.6
On training dataOn test data
© Daniel S. Weld 37
Overfitting…
• DT is overfit when exists another DT’ and DT has smaller error on training examples,
but DT has bigger error on test examples
• Causes of overfitting Noisy data, or Training set is too small