CSC 4510 – Machine Learning
description
Transcript of CSC 4510 – Machine Learning
![Page 1: CSC 4510 – Machine Learning](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152b3550346895dc0d58c/html5/thumbnails/1.jpg)
CSC 4510 – Machine LearningDr. Mary-Angela PapalaskariDepartment of Computing SciencesVillanova University
Course website:www.csc.villanova.edu/~map/4510/
Lecture 4: Decision Trees, continued
1CSC 4510 - M.A. Papalaskari - Villanova University
![Page 2: CSC 4510 – Machine Learning](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152b3550346895dc0d58c/html5/thumbnails/2.jpg)
2
Tiny Example of Decision Tree Learning
• Instance attributes: <size, color, shape>• C = {positive, negative}
• D:
Example Size Color Shape Category
1 small red circle positive
2 large red circle positive
3 small red triangle negative
4 large blue circle negativeCSC 4510 - M.A. Papalaskari - Villanova University
color
red blue green
shape
circle square triangleneg pos
pos neg neg
![Page 3: CSC 4510 – Machine Learning](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152b3550346895dc0d58c/html5/thumbnails/3.jpg)
3
Decision Trees• Tree-based classifiers for instances represented as feature-vectors. Nodes
test features, there is one branch for each value of the feature, and leaves specify the category.
• Can represent arbitrary conjunction and disjunction. Can represent any classification function over discrete feature vectors.
• Can be rewritten as a set of rules, i.e. disjunctive normal form (DNF).– red circle → pos– red circle → A blue → B; red square → B green → C; red triangle → C
color
red blue green
shape
circle square triangleneg pos
pos neg neg
color
red blue green
shape
circle square triangle B C
A B C
CSC 4510 - M.A. Papalaskari - Villanova University
![Page 4: CSC 4510 – Machine Learning](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152b3550346895dc0d58c/html5/thumbnails/4.jpg)
4
Top-Down Decision Tree Creation
<big, red, circle>: + <small, red, circle>: +<small, red, square>: <big, blue, circle>:
color
red blue green
<big, red, circle>: + <small, red, circle>: +<small, red, square>:
CSC 4510 - M.A. Papalaskari - Villanova University
![Page 5: CSC 4510 – Machine Learning](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152b3550346895dc0d58c/html5/thumbnails/5.jpg)
5
shape
circle square triangle
Top-Down Decision Tree Creation
<big, red, circle>: + <small, red, circle>: +<small, red, square>: <big, blue, circle>:
<big, red, circle>: + <small, red, circle>: +<small, red, square>:
color
red blue green
<big, red, circle>: + <small, red, circle>: +
pos<small, red, square>:
neg pos
<big, blue, circle>: neg neg
CSC 4510 - M.A. Papalaskari - Villanova University
![Page 6: CSC 4510 – Machine Learning](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152b3550346895dc0d58c/html5/thumbnails/6.jpg)
6
First step in detailStartValue Examples Count Probabilitypositive: <big, red, circle> 2 .5 <small, red, circle>negative: <small, red, square> 2 .5 <big, blue, circle>
colorred
blue
green
Color=redValue/Examples Count Probabilitypositive: 2 .667 <big, red, circle> <small, red, circle>negative: 1 .333 <small, red, square> Color=blue
Value/Examples Count Probabilitypositive: none 0 0negative: 1 1 <small, red, square>
Color=greenValue/Examples Count Probabilitypositive: none 0 0negative: none 0 0 <small, red, square>
CSC 4510 - M.A. Papalaskari - Villanova University
![Page 7: CSC 4510 – Machine Learning](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152b3550346895dc0d58c/html5/thumbnails/7.jpg)
7
First step in detailStartValue Examples Count Probabilitypositive: <big, red, circle> 2 .5 <small, red, circle>negative: <small, red, square> 2 .5 <big, blue, circle>
colorred
blue
green
Color=redValue/Examples Count Probabilitypositive: 2 .667 <big, red, circle> <small, red, circle>negative: 1 .333 <small, red, square> Color=blue
Value/Examples Count Probabilitypositive: none 0 0negative: 1 1 <small, red, square>
Color=greenValue/Examples Count Probabilitypositive: none 0 0negative: none 0 0
Entropy=0.92
Entropy=1
Entropy=0
CSC 4510 - M.A. Papalaskari - Villanova University
![Page 8: CSC 4510 – Machine Learning](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152b3550346895dc0d58c/html5/thumbnails/8.jpg)
8
First step in detailStartValue Examples Count Probabilitypositive: <big, red, circle> 2 .5 <small, red, circle>negative: <small, red, square> 2 .5 <big, blue, circle>
colorred
blue
green
Color=redValue/Examples Count Probabilitypositive: 2 .667 <big, red, circle> <small, red, circle>negative: 1 .333 <small, red, square> Color=blue
Value/Examples Count Probabilitypositive: none 0 0negative: 1 1 <small, red, square>
Color=greenValue/Examples Count Probabilitypositive: none 0 0negative: none 0 0
Entropy=0.92
Entropy=1
Entropy=0
Gain(Color) = Entropy(start) – (Entropy(red)*(3/4) – Entropy(blue)*(1/4) = 1 – 0.69 – 0 = 0.31
Entropy=0
CSC 4510 - M.A. Papalaskari - Villanova University
![Page 9: CSC 4510 – Machine Learning](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152b3550346895dc0d58c/html5/thumbnails/9.jpg)
9
First step in detail- Split on Size?StartValue Examples Count Probabilitypositive: <big, red, circle> 2 .5 <small, red, circle>negative: <small, red, square> 2 .5 <big, blue, circle>
sizeBIG
small
Size=BIGValue/Examples Count Probabilitypositive: 1 .5 <big, red, circle>negative: 1 .5 <big, blue, circle>
Size = smallValue/Examples Count Probabilitypositive: 1 .5 <small, red, circle>negative: <small, red, square> 1 .5
CSC 4510 - M.A. Papalaskari - Villanova University
![Page 10: CSC 4510 – Machine Learning](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152b3550346895dc0d58c/html5/thumbnails/10.jpg)
10
First step in detail- Split on SizeStartValue Examples Count Probabilitypositive: <big, red, circle> 2 .5 <small, red, circle>negative: <small, red, square> 2 .5 <big, blue, circle>
sizeBIG
small
Size=BIGValue/Examples Count Probabilitypositive: 1 .5 <big, red, circle>negative: 1 .5 <big, blue, circle>
Size = smallValue/Examples Count Probabilitypositive: 1 .5 <small, red, circle>negative: <small, red, square> 1 .5
Entropy=1
Entropy=1
Entropy=1
CSC 4510 - M.A. Papalaskari - Villanova University
![Page 11: CSC 4510 – Machine Learning](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152b3550346895dc0d58c/html5/thumbnails/11.jpg)
11
First step in detail- Split on SizeStartValue Examples Count Probabilitypositive: <big, red, circle> 2 .5 <small, red, circle>negative: <small, red, square> 2 .5 <big, blue, circle>
sizeBIG
small
Size=BIGValue/Examples Count Probabilitypositive: 1 .5 <big, red, circle>negative: 1 .5 <big, blue, circle>
Size = smallValue/Examples Count Probabilitypositive: 1 .5 <small, red, circle>negative: <small, red, square> 1 .5
Entropy=1
Entropy=1
Gain(Size) = Entropy(start) – (Entropy(BIG)*(2/4) – Entropy(small)*(2/4) = 1 – 1/2 – 1/2 = 0
Entropy=1
CSC 4510 - M.A. Papalaskari - Villanova University
![Page 12: CSC 4510 – Machine Learning](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152b3550346895dc0d58c/html5/thumbnails/12.jpg)
12
shape
circle square triangle
Top-Down Decision Tree Induction
• Build a tree top-down by divide and conquer.
<big, red, circle>: + <small, red, circle>: +<small, red, square>: <big, blue, circle>:
<big, red, circle>: + <small, red, circle>: +<small, red, square>:
color
red blue green
<big, red, circle>: + <small, red, circle>: +
pos<small, red, square>:
neg pos<big, blue, circle>:
neg neg
CSC 4510 - M.A. Papalaskari - Villanova University
![Page 13: CSC 4510 – Machine Learning](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152b3550346895dc0d58c/html5/thumbnails/13.jpg)
Another view (Aispace)
CSC 4510 - M.A. Papalaskari - Villanova University 13
![Page 14: CSC 4510 – Machine Learning](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152b3550346895dc0d58c/html5/thumbnails/14.jpg)
14
Picking a Good Split Feature• Goal is to have the resulting tree be as small as possible, per
Occam’s razor.• Finding a minimal decision tree (nodes, leaves, or depth) is an
NP-hard optimization problem.• Top-down divide-and-conquer method does a greedy search
for a simple tree but does not guarantee to find the smallest.– General lesson in ML: “Greed is good.”
• Want to pick a feature that creates subsets of examples that are relatively “pure” in a single class so they are “closer” to being leaf nodes.
• There are a variety of heuristics for picking a good test, a popular one is based on information gain that originated with the ID3 system of Quinlan (1979).
CSC 4510 - M.A. Papalaskari - Villanova University
![Page 15: CSC 4510 – Machine Learning](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152b3550346895dc0d58c/html5/thumbnails/15.jpg)
15
Entropy• Entropy (disorder, impurity) of a set of examples, S, relative to a binary
classification is:
where p1 is the fraction of positive examples in S and p0 is the fraction of negatives.
• If all examples are in one category, entropy is zero (we define 0log(0)=0)
• If examples are equally mixed (p1=p0=0.5), entropy is a maximum of 1.
• Entropy can be viewed as the number of bits required on average to encode the class of an example in S where data compression (e.g. Huffman coding) is used to give shorter codes to more likely cases.
• For multi-class problems with c categories, entropy generalizes to:
)(log)(log)( 020121 ppppSEntropy
c
iii ppSEntropy
12 )(log)(
CSC 4510 - M.A. Papalaskari - Villanova University
![Page 16: CSC 4510 – Machine Learning](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152b3550346895dc0d58c/html5/thumbnails/16.jpg)
16
Entropy Plot for Binary Classification
CSC 4510 - M.A. Papalaskari - Villanova University
![Page 17: CSC 4510 – Machine Learning](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152b3550346895dc0d58c/html5/thumbnails/17.jpg)
17
Information Gain• The information gain of a feature F is the expected reduction in
entropy resulting from splitting on this feature.
where Sv is the subset of S having value v for feature F.
• Entropy of each resulting subset weighted by its relative size.• Example summary
– <big, red, circle>: + <small, red, circle>: +– <small, red, square>: <big, blue, circle>:
)()(),()(
vFValuesv
v SEntropyS
SSEntropyFSGain
2+, 2 : E=1 size
big small1+,1 1+,1E=1 E=1
Gain=1(0.51 + 0.51) = 0
2+, 2 : E=1 color
red blue2+,1 0+,1E=0.918 E=0
Gain=1(0.750.918 + 0.250) = 0.311
2+, 2 : E=1 shape
circle square2+,1 0+,1E=0.918 E=0
Gain=1(0.750.918 + 0.250) = 0.311
CSC 4510 - M.A. Papalaskari - Villanova University
![Page 18: CSC 4510 – Machine Learning](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152b3550346895dc0d58c/html5/thumbnails/18.jpg)
18
Bias in Decision-Tree Induction
• Information-gain gives a bias for trees with minimal depth.
Recall: • Bias
– Any criteria other than consistency with the training data that is used to select a hypothesis.
CSC 4510 - M.A. Papalaskari - Villanova University
![Page 19: CSC 4510 – Machine Learning](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152b3550346895dc0d58c/html5/thumbnails/19.jpg)
19
History of Decision-Tree Research• Hunt and colleagues use exhaustive search decision-tree
methods (CLS) to model human concept learning in the 1960’s.• In the late 70’s, Quinlan developed ID3 with the information
gain heuristic to learn expert systems from examples.• Simulataneously, Breiman and Friedman and colleagues develop
CART (Classification and Regression Trees), similar to ID3.• In the 1980’s a variety of improvements are introduced to
handle noise, continuous features, missing features, and improved splitting criteria. Various expert-system development tools results.
• Quinlan’s updated decision-tree package (C4.5) released in 1993.
• Weka includes Java version of C4.5 called J48.
CSC 4510 - M.A. Papalaskari - Villanova University
![Page 20: CSC 4510 – Machine Learning](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152b3550346895dc0d58c/html5/thumbnails/20.jpg)
20
Additional Decision Tree Issues
• Better splitting criteria– Information gain prefers features with many values.
• Continuous features• Predicting a real-valued function (regression trees)• Missing feature values• Features with costs• Misclassification costs• Incremental learning
– ID4– ID5
• Mining large databases that do not fit in main memoryCSC 4510 - M.A. Papalaskari - Villanova University
![Page 21: CSC 4510 – Machine Learning](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152b3550346895dc0d58c/html5/thumbnails/21.jpg)
Class Exercise• Practice using decision tree learning on some of the sample
datasets available in AISpace
21CSC 4510 - M.A. Papalaskari - Villanova University
Some of the slides in this presentation are adapted from:• Prof. Frank Klassner’s ML class at Villanova• the University of Manchester ML course http://www.cs.manchester.ac.uk/ugt/COMP24111/• The Stanford online ML course http://www.ml-class.org/
![Page 22: CSC 4510 – Machine Learning](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152b3550346895dc0d58c/html5/thumbnails/22.jpg)
Resources: Datasets
• UCI Repository: http://www.ics.uci.edu/~mlearn/MLRepository.html
• UCI KDD Archive: http://kdd.ics.uci.edu/summary.data.application.html
• Statlib: http://lib.stat.cmu.edu/
• Delve: http://www.cs.utoronto.ca/~delve/
22CSC 4510 - M.A. Papalaskari - Villanova University